二分类变量的logistic regression计算及ROC可视化

离散型的二分类变量(预测值只有两种,如0或1)在没有做转化是无法进行回归分析(一般回归分析适用于连续型变量),我们可以借助广义线性回归模型logistic回归对二分类变量进行转化(对预测值使用了逻辑函数,也即是预测值的范围是0到1),最终实现回归分析。

logistic regression基础

  • logistic 函数:

  • logistic方程和logistic曲线:逻辑曲线在z=0时,十分敏感,在z>>0或z<<0处,都不敏感,将预测值限定为(0,1)

加载数据

1
2
3
4
# install.packages("ISLR") # 使用该包的数据

data <- ISLR::Default
head(data)
1
2
3
4
5
6
7
 default student   balance    income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559

训练模型

  • 构建训练集和测试集

  • 训练模型

1
2
3
4
5
6
7
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
train <- data[sample, ]
test <- data[-sample, ]

fit <- glm(default ~ student+balance+income, family="binomial", data=train)
summary(fit)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Call:
glm(formula = default ~ student + balance + income, family = "binomial",
data = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.5586 -0.1353 -0.0519 -0.0177 3.7973

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.148e+01 6.234e-01 -18.412 <2e-16 ***
studentYes -4.933e-01 2.857e-01 -1.726 0.0843 .
balance 5.988e-03 2.938e-04 20.384 <2e-16 ***
income 7.857e-06 9.965e-06 0.788 0.4304
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2021.1 on 6963 degrees of freedom
Residual deviance: 1065.4 on 6960 degrees of freedom
AIC: 1073.4

Number of Fisher Scoring iterations: 8

ROC曲线

  • 获取预测值: type的值可以是“link”, “response”, “terms” (需要了解下有何区别)

    • link:
    • response:因变量分类变量预测为”Yes|No”的概率值
    • terms:自变量预测的概率值(待定)
1
2
predicted <- predict(fit, test, type="response")
head(predicted)
1
2
           4            6            7           15           17           18 
3.259679e-04 1.648925e-03 1.762607e-03 9.694038e-03 1.536866e-05 1.709650e-04

Notes:每个样本对应的预测值是数字而不是我以为的 default的 ”Yes|No“
Notes:测试样本的default预测为”Yes|No“的概率值(待定)

  • 获取ROC曲线:通过pROC的roc和auc函数分别获取roc对象和auc值
    1
    2
    3
    4
    library(ggplot2)
    library(pROC)
    rocobj <- roc(test$default, predicted)
    auc <- round(auc(test$default, predicted),4)
  • 可视化结果

    • legacy.axes 控制specificity(x 轴)是否升序排列
  • geom_abline 添加对角线

  • annotate 添加AUC结果

  • coord_cartesian 设置坐标轴范围

  • family=”serif” 设置新罗马字体

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ggroc(rocobj, color = "red", linetype = 1, size = 1, alpha = 1, legacy.axes = T)+
geom_abline(intercept = 0, slope = 1, color="grey", size = 1, linetype=1)+
labs(x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensivity or Recall)")+
annotate("text",x = .75, y = .25,label=paste("AUC =", auc),
size = 5, family="serif")+
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))+
theme_bw()+
theme(panel.background = element_rect(fill = 'transparent'),
axis.ticks.length = unit(0.4, "lines"),
axis.ticks = element_line(color='black'),
axis.line = element_line(size=.5, colour = "black"),
axis.title = element_text(colour='black', size=12,face = "bold"),
axis.text = element_text(colour='black',size=10,face = "bold"),
text = element_text(size=8, color="black", family="serif"))

R Information

1
sessionInfo()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ISLR_1.2 forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[7] tidyverse_1.3.0 xgboost_1.3.1.1 mlbench_2.1-1 survminer_0.4.8 ggpubr_0.4.0 survcomp_1.40.0
[13] prodlim_2019.11.13 survival_3.2-7 caretEnsemble_2.0.1 pROC_1.16.2 caret_6.0-86 ggplot2_3.3.3
[19] lattice_0.20-41 data.table_1.13.6 tibble_3.0.4 dplyr_1.0.2

loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.2.0 plyr_1.8.6 splines_4.0.3 digest_0.6.27
[6] SuppDists_1.1-9.5 foreach_1.5.1 htmltools_0.5.0 fansi_0.4.1 magrittr_1.5
[11] openxlsx_4.2.3 recipes_0.1.15 modelr_0.1.8 gower_0.2.2 colorspace_2.0-0
[16] rvest_0.3.6 haven_2.3.1 xfun_0.19 crayon_1.3.4 jsonlite_1.7.1
[21] libcoin_1.0-7 zoo_1.8-8 iterators_1.0.13 glue_1.4.2 gtable_0.3.0
[26] ipred_0.9-9 questionr_0.7.3 car_3.0-10 kernlab_0.9-29 abind_1.4-5
[31] scales_1.1.1 mvtnorm_1.1-1 DBI_1.1.0 rstatix_0.6.0 miniUI_0.1.1.1
[36] Rcpp_1.0.5 xtable_1.8-4 Cubist_0.2.3 foreign_0.8-80 km.ci_0.5-2
[41] Formula_1.2-4 stats4_4.0.3 lava_1.6.8.1 httr_1.4.2 ellipsis_0.3.1
[46] pkgconfig_2.0.3 farver_2.0.3 nnet_7.3-14 dbplyr_2.0.0 utf8_1.1.4
[51] tidyselect_1.1.0 labeling_0.4.2 rlang_0.4.8 reshape2_1.4.4 later_1.1.0.1
[56] munsell_0.5.0 cellranger_1.1.0 tools_4.0.3 cli_2.1.0 generics_0.1.0
[61] broom_0.7.3 evaluate_0.14 fastmap_1.0.1 yaml_2.2.1 bootstrap_2019.6
[66] ModelMetrics_1.2.2.2 knitr_1.30 fs_1.5.0 zip_2.1.1 survMisc_0.5.5
[71] caTools_1.18.0 randomForest_4.6-14 pbapply_1.4-3 nlme_3.1-150 mime_0.9
[76] xml2_1.3.2 compiler_4.0.3 rstudioapi_0.12 curl_4.3 e1071_1.7-4
[81] ggsignif_0.6.0 reprex_0.3.0 klaR_0.6-15 stringi_1.5.3 highr_0.8
[86] Matrix_1.2-18 gbm_2.1.8 ggsci_2.9 survivalROC_1.0.3 KMsurv_0.1-5
[91] vctrs_0.3.4 pillar_1.4.6 lifecycle_0.2.0 combinat_0.0-8 cowplot_1.1.1
[96] bitops_1.0-6 httpuv_1.5.4 R6_2.5.0 promises_1.1.1 KernSmooth_2.23-18
[101] gridExtra_2.3 C50_0.1.3.1 rio_0.5.16 codetools_0.2-18 MASS_7.3-53
[106] assertthat_0.2.1 withr_2.3.0 parallel_4.0.3 hms_0.5.3 grid_4.0.3
[111] rpart_4.1-15 labelled_2.7.0 timeDate_3043.102 class_7.3-17 rmarkdown_2.5
[116] inum_1.0-1 carData_3.0-4 partykit_1.2-11 shiny_1.5.0 lubridate_1.7.9
[121] rmeta_3.0

参考

  1. logistic回归介绍

  2. logistic回归模型基础

  3. How to Plot a ROC curve using ggplot2

参考文章如引起任何侵权问题,可以与我联系,谢谢。


------------- The End Thanks for reading --------