基于LASSO算法筛选重要变量后的分类模型

前言

上一节,我们筛选得到差异的蛋白质后,可以做后续分析如与可表征疾病临床特征的指标做关联分析,也可以基于差异蛋白进行再次筛选重要蛋白质,最后基于这些蛋白质构建分类模型。

PS:在预后分析过程中,基于重要蛋白质丰度值和对应LASSO系数计算预后得分,再结合得分和生存时间状态计算相关性,最后也可以对预后得分进行阈值划分人群区分高低风险患者。另外有更取巧的是直接对基因区分在疾病人群高表达和低表达两类基因群构建基因集,利用GSVA包的ssgsea算法对样本的不同基因集打分,基于打分构建分类模型。

本文只介绍:

  1. LASSO模型筛选非零系数的重要蛋白质;
  2. 基于重要蛋白质构建分类器。

导入数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(dplyr)
library(tibble)
library(ggplot2)
library(data.table)
library(glmnet)

# rm(list = ls())
options(stringsAsFactors = F)
options(future.globals.maxSize = 1000 * 1024^2)

subtype <- c("PAT", "PCT", "LMS")
type.col <- c("#E860A4", "#9748BC", "#5AADD8")

PAT_PCT_DEG <- fread("PAT_PCT_limma_Gene_Unpaired.csv")
ExprSet <- readRDS("GeneExprSet_filtered.RDS")

Feature Selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
LASSO_fun <- function(dataset=ExprSet,
restest=PAT_PCT_DEG,
group_name=subtype[1:2]){
# dataset=ExprSet
# restest=PAT_PCT_DEG
# group_name=subtype[1:2]

phen <- pData(dataset) %>%
filter(SubType %in% group_name)
prof <- exprs(dataset)
feature <- restest %>% filter(Enrichment != "Nonsignif")

prof_cln <- prof[rownames(prof)%in%feature$GeneID,
colnames(prof)%in%rownames(phen)] %>% t()

if(!any(rownames(prof_cln) == rownames(phen))){
stop("Order of SampleID is wrong")
}

dat_target <- phen$SubType
dat_table <- model.matrix(~., scale(prof_cln) %>% data.frame())

# feature selection
set.seed(123)
lasso <- cv.glmnet(x=dat_table,
y=dat_target,
family='binomial',
nfolds = 10,
alpha = 1,
nlambda = 100)

lasso.mk <- data.frame(as.matrix(coef(lasso, lasso$lambda.min))) %>%
setNames("Score") %>%
rownames_to_column("Feature") %>%
slice(-1) %>%
filter(Score!=0)

res <- list(fit=lasso,
marker=lasso.mk)
return(res)
}

PAT_PCT_LASSO <- LASSO_fun(
dataset=ExprSet,
restest=PAT_PCT_DEG,
group_name=subtype[1:2])
DT::datatable(PAT_PCT_LASSO$marker)

Note: 响应变量是分类变量,需要设置二项分布binomial

Classification

分类器:logistic regression和support vector machine两类方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
model_fun <- function(dataset=ExprSet,
datfit=PAT_PCT_LASSO,
model="glm",
group_name=subtype[1:2],
prob=0.7){
#dataset=ExprSet
#datfit=PAT_PCT_LASSO
#model="glm"
#group_name=subtype[1:2]
#prob=0.7

phen <- pData(dataset) %>%
filter(SubType %in% group_name)
prof <- exprs(dataset)
prof_cln <- prof[rownames(prof)%in%datfit$marker$Feature,
colnames(prof)%in%rownames(phen)] %>% t()

if(!any(rownames(prof_cln) == rownames(phen))){
stop("Order of SampleID is wrong")
}

mdat <- phen %>% rownames_to_column("SampleID") %>%
dplyr::select(SampleID, SubType) %>%
inner_join(prof_cln %>% data.frame() %>%
rownames_to_column("SampleID"),
by ="SampleID") %>%
column_to_rownames("SampleID") %>%
mutate(SubType=factor(SubType, levels = group_name))

# build model
set.seed(123)
require(sampling)
require(pROC)
sample <- sampling::strata(mdat,
stratanames = "SubType",
size = rev(round(as.numeric(table(mdat$SubType)) * prob)),
method = "srswor")
trainData <- mdat[sample$ID_unit, ]
testData <- mdat[-sample$ID_unit, ]

if(model == "glm"){
fit <- glm(SubType~., family="binomial", data=trainData)
# summary(fit)
predicted <- predict(fit, testData, type="response")
# head(predicted)
rocobj <- roc(testData$SubType, predicted)
auc <- round(auc(testData$SubType, predicted), 4)
}else if(model == "svm"){
require(e1071)
set.seed(123)
tuned <- tune.svm(SubType~., data=trainData, gamma=10^(-6:1), cost=10^(-10:10))
fit <- svm(SubType~., data=trainData, scale=TRUE,
gamma=tuned$best.parameters$gamma,
cost=tuned$best.parameters$cost)
predicted <- predict(fit, testData, type="response")
# head(predicted)
rocobj <- roc(response=testData$SubType,
predictor=factor(predicted, ordered=T))
auc <- round(auc(response=testData$SubType, predictor=factor(predicted, ordered=T)), 4)
}
pl <- ggroc(rocobj, color = "red", linetype = 1, size = 1, alpha = 1, legacy.axes = T)+
geom_abline(intercept = 0, slope = 1, color="grey", size = 1, linetype=1)+
labs(x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensivity or Recall)")+
annotate("text",x = .75, y = .25,label=paste("AUC =", auc),
size = 5, family="serif")+
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))+
theme_bw()+
theme(panel.background = element_rect(fill = 'transparent'),
axis.ticks.length = unit(0.4, "lines"),
axis.ticks = element_line(color='black'),
axis.line = element_line(size=.5, colour = "black"),
axis.title = element_text(colour='black', size=12,face = "bold"),
axis.text = element_text(colour='black',size=10),
text = element_text(size=8, color="black", family="serif"))

return(pl)
}

PAT_PCT_svm <- model_fun(dataset=ExprSet,
datfit=PAT_PCT_LASSO,
model="svm",
group_name=subtype[1:2],
prob=0.7)

PAT_PCT_svm

Notes:

  1. 分成抽样过程需要注意两组size是否一致,这也是为何加rev函数的原因;

  2. 分类器构建的样本量要足够大,否则出现的ROC曲线是一条直线,这是因为specificites值都相同;

  3. 样本划分比例过高或过低都可能造成模型过拟合或泛化不足,当然样本量太小大概率出现过拟合的情况。

systemic information

1
sessionInfo()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 8 (Core)

Matrix products: default
BLAS/LAPACK: /disk/share/anaconda3/lib/libopenblasp-r0.3.10.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] e1071_1.7-6 ISLR_1.2 pROC_1.16.2
[4] sampling_2.8 glmnet_4.0-2 Matrix_1.3-2
[7] patchwork_1.0.1 enrichplot_1.10.1 clusterProfiler_3.18.0
[10] ggvenn_0.1.8 eulerr_6.1.0 ggrepel_0.9.1.9999
[13] SummarizedExperiment_1.20.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.4
[16] IRanges_2.24.1 S4Vectors_0.28.1 MatrixGenerics_1.2.1
[19] matrixStats_0.58.0 data.table_1.14.0 umap_0.2.7.0
[22] vegan_2.5-6 lattice_0.20-41 permute_0.9-5
[25] Rtsne_0.15 convert_1.64.0 marray_1.68.0
[28] limma_3.46.0 Biobase_2.50.0 BiocGenerics_0.36.0
[31] ggplot2_3.3.3 tibble_3.1.0 dplyr_1.0.5

loaded via a namespace (and not attached):
[1] utf8_1.2.1 reticulate_1.18 tidyselect_1.1.0 RSQLite_2.2.5
[5] AnnotationDbi_1.52.0 htmlwidgets_1.5.3 BiocParallel_1.24.1 lpSolve_5.6.15
[9] scatterpie_0.1.5 munsell_0.5.0 codetools_0.2-18 DT_0.18
[13] withr_2.4.1 colorspace_2.0-0 GOSemSim_2.16.1 knitr_1.31
[17] ggsignif_0.6.0 DOSE_3.16.0 labeling_0.4.2 GenomeInfoDbData_1.2.4
[21] polyclip_1.10-0 bit64_4.0.5 farver_2.1.0 downloader_0.4
[25] vctrs_0.3.7 generics_0.1.0 xfun_0.20 R6_2.5.0
[29] graphlayouts_0.7.1 locfit_1.5-9.4 bitops_1.0-6 cachem_1.0.4
[33] fgsea_1.16.0 DelayedArray_0.16.3 assertthat_0.2.1 promises_1.2.0.1
[37] scales_1.1.1 ggraph_2.0.4 gtable_0.3.0 tidygraph_1.2.0
[41] rlang_0.4.10 splines_4.0.2 rstatix_0.7.0 broom_0.7.6
[45] BiocManager_1.30.12 yaml_2.2.1 reshape2_1.4.4 abind_1.4-5
[49] crosstalk_1.1.1 backports_1.2.1 httpuv_1.5.5 qvalue_2.22.0
[53] tools_4.0.2 ellipsis_0.3.1 jquerylib_0.1.3 RColorBrewer_1.1-2
[57] proxy_0.4-25 Rcpp_1.0.6 plyr_1.8.6 zlibbioc_1.36.0
[61] purrr_0.3.4 RCurl_1.98-1.3 ggpubr_0.4.0 openssl_1.4.3
[65] viridis_0.5.1 cowplot_1.1.0 haven_2.3.1 cluster_2.1.0
[69] tinytex_0.31 magrittr_2.0.1 RSpectra_0.16-0 DO.db_2.9
[73] openxlsx_4.2.3 xtable_1.8-4 hms_1.0.0 mime_0.10
[77] evaluate_0.14 rio_0.5.16 readxl_1.3.1 gridExtra_2.3
[81] shape_1.4.5 compiler_4.0.2 crayon_1.4.1 shadowtext_0.0.7
[85] htmltools_0.5.1.1 mgcv_1.8-34 later_1.1.0.1 tidyr_1.1.3
[89] DBI_1.1.1 tweenr_1.0.1 MASS_7.3-53.1 car_3.0-10
[93] cli_2.4.0 igraph_1.2.6 forcats_0.5.0 pkgconfig_2.0.3
[97] rvcheck_0.1.8 foreign_0.8-81 foreach_1.5.1 bslib_0.2.4
[101] XVector_0.30.0 stringr_1.4.0 digest_0.6.27 rmarkdown_2.7
[105] cellranger_1.1.0 fastmatch_1.1-0 edgeR_3.32.1 curl_4.3
[109] shiny_1.6.0 lifecycle_1.0.0 nlme_3.1-150 jsonlite_1.7.2
[113] carData_3.0-4 viridisLite_0.4.0 askpass_1.1 fansi_0.4.2
[117] pillar_1.6.0 fastmap_1.1.0 survival_3.2-10 GO.db_3.12.1
[121] glue_1.4.2 zip_2.1.1 iterators_1.0.13 bit_4.0.4
[125] ggforce_0.3.2 class_7.3-18 stringi_1.4.6 sass_0.3.1
[129] blob_1.2.1 memoise_2.0.0

Reference

  1. LASSO Regression in R | An Efficient Way of Learning Regression

参考文章如引起任何侵权问题,可以与我联系,谢谢。


------------- The End Thanks for reading --------