How to choose the optimal features in Random Forest using R

Random Forest, a ensemble machine learning algorithm with multiple decision trees could be used for classification or regression algorithm, and it has an elegant way of dealing with nonlinear or linear data.

Random forest aims to reduce the previously mentioned correlation issue by choosing only a subsample of the feature space at each split. Essentially, it aims to make the trees de-correlated and prune the trees by setting a stopping criteria for node splits, which I will cover in more detail later.

However, how to choose the optimal features to rebuild models is the key step for users. Herein, we gave the detailed tutorial step by step.

Importing packages

R packages used in this tutorial.

knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(dplyr)
library(tibble)
library(randomForest)
library(ggplot2)
library(data.table)
library(caret)
library(pROC)

# rm(list = ls())
options(stringsAsFactors = F)
options(future.globals.maxSize = 1000 * 1024^2)

Importing data

clean_data.csv is from Breast-cancer-risk-prediction. Using wget https://github.com/Jean-njoroge/Breast-cancer-risk-prediction/blob/master/data/clean-data.csv to download it.

clean_data.csv contains 569 samples of malignant and benign tumor cells

The Breast Cancer datasets is available machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.

  1. The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively.
  1. The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.
datset <- data.table::fread("clean_data.csv")

head(datset[, 1:6])
   V1 diagnosis radius_mean texture_mean perimeter_mean
1:  0         M       17.99        10.38         122.80
2:  1         M       20.57        17.77         132.90
3:  2         M       19.69        21.25         130.00
4:  3         M       11.42        20.38          77.58
5:  4         M       20.29        14.34         135.10
6:  5         M       12.45        15.70          82.57

Data Partition

Splitting dataset into TrainSet and TestSet, the former is used to build random forest model and the latter is used to evaluate the performance of model.

We used caret::createDataPartition with parameters p = 0.7 to create trainData and testData dataset.

mdat <- datset %>%
  dplyr::select(-V1) %>%
  dplyr::rename(Group = diagnosis) %>%
  dplyr::mutate(Group = factor(Group)) %>%
  data.frame()
colnames(mdat) <- make.names(colnames(mdat))


set.seed(123)
trainIndex <- caret::createDataPartition(
          mdat$Group, 
          p = 0.7, 
          list = FALSE, 
          times = 1)

trainData <- mdat[trainIndex, ]
X_train <- trainData[, -1]
y_train <- trainData[, 1]

testData <- mdat[-trainIndex, ]
X_test <- testData[, -1]
y_test <- testData[, 1]

Building model

fitting model with default using randomForest based on trainData set.

set.seed(123)
rf_fit <- randomForest(Group ~ ., data = trainData, importance = TRUE, proximity = TRUE)
rf_fit
Call:
 randomForest(formula = Group ~ ., data = trainData, importance = TRUE,      proximity = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 5

        OOB estimate of  error rate: 4.26%
Confusion matrix:
    B   M class.error
B 241   9  0.03600000
M   8 141  0.05369128

The OOB estimate of error rate is 4.26%, indicating the Accuracy of this RF model is 95.74%.

Biomarkers ordered by MeanDecreaseAccuracy

Ordering the features’ importance by MeanDecreaseAccuracy.

imp_biomarker <- tibble::as_tibble(round(importance(rf_fit), 2), rownames = "Features") %>% 
  dplyr::arrange(desc(MeanDecreaseAccuracy))

head(imp_biomarker)
# A tibble: 6 × 5
  Features                 B     M MeanDecreaseAccuracy MeanDecreaseGini
  <chr>                <dbl> <dbl>                <dbl>            <dbl>
1 concave.points_worst 14.6  12.5                  18.0            24.0 
2 perimeter_worst      13.7  11.6                  16.2            26.8 
3 radius_worst         13.1  11.4                  16.1            21.2 
4 area_worst           11.9  10.5                  14.5            20.5 
5 concavity_worst       8.94  9.74                 13.3             6.46
6 texture_worst         8.27 10.6                  12.9             3.27

MeanDecreaseAccuracy shows that the contribution of the features. For exmple, removing concave.points_worst would decrease the accuracy by 18.0%. The higher the MeanDecreaseAccuracy of the features, the more they contribute to the model.

Cross validation

5 replicates for the 5-folds cross-validation to obtain the mean error in order to avoid the over-fitting and under-fitting. Meanwhile, the optimal number of features is selected according to the number with the mean error.

error.cv <- c()
for (i in 1:5){
  print(i)
  set.seed(i)
  fit <- rfcv(trainx = X_train, 
              trainy = y_train, 
              cv.fold = 5, 
              scale = "log", 
              step = 0.9)
  error.cv <- cbind(error.cv, fit$error.cv)
}

n.var <- as.numeric(rownames(error.cv))
colnames(error.cv) <- paste('error', 1:5, sep = '.')
err.mean <- apply(error.cv, 1, mean)
err.df <- data.frame(num = n.var, 
                     err.mean = err.mean,
                     error.cv) 
head(err.df[, 1:6])
   num   err.mean    error.1    error.2    error.3    error.4
30  30 0.04110276 0.03759398 0.04260652 0.04260652 0.04260652
27  27 0.04160401 0.04010025 0.04010025 0.04761905 0.04010025
24  24 0.04260652 0.04260652 0.04260652 0.04260652 0.04761905
22  22 0.04360902 0.04260652 0.04260652 0.04511278 0.04511278
20  20 0.04210526 0.04260652 0.04511278 0.04260652 0.04260652
18  18 0.04060150 0.04260652 0.04260652 0.04010025 0.04260652
  • optimal number of biomarkers chosen by min cv.error
optimal <- err.df$num[which(err.df$err.mean == min(err.df$err.mean))]
main_theme <- 
  theme(
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.line.x = element_line(linewidth = 0.5, color = "black"),
    axis.line.y = element_line(linewidth = 0.5, color = "black"),
    axis.ticks = element_line(color = "black"),
    axis.text = element_text(color = "black", size = 12),
    legend.position = "right",
    legend.background = element_blank(),
    legend.key = element_blank(),
    legend.text = element_text(size = 12),
    text = element_text(family = "sans", size = 12))

pl <- 
  ggplot(data = err.df, aes(x = err.df$num)) + 
    geom_line(aes(y = err.df$error.1), color = 'grey', size = 0.5) + 
    geom_line(aes(y = err.df$error.2), color = 'grey', size = 0.5) +
    geom_line(aes(y = err.df$error.3), color = 'grey', size = 0.5) +
    geom_line(aes(y = err.df$error.4), color = 'grey', size = 0.5) +
    geom_line(aes(y = err.df$error.5), color = 'grey', size = 0.5) +
    geom_line(aes(y = err.df$err.mean), color = 'black', size = 0.5) + 
    geom_vline(xintercept = optimal, color = 'red', lwd = 0.36, linetype = 2) + 
    coord_trans(x = "log2") +
    scale_x_continuous(breaks = c(1, 5, 10, 20, 30)) +
    labs(x = 'Number of Species ', y = 'Cross-validation error rate') + 
    annotate("text", 
             x = optimal, 
             y = max(err.df$err.mean), 
             label = paste("Optimal = ", optimal, sep = ""),
             color = "red") +
    main_theme
pl

Since the figure shows that the model had a minimum error rate when the number of features was 18, the optimal number of biomarkers is 18.

  • importance of optimal biomarker

Displaying the MeanDecreaseAccuracy of the best biomarkers

imp_biomarker[1:optimal, ] %>%
  dplyr::select(Features, MeanDecreaseAccuracy) %>%
  dplyr::arrange(MeanDecreaseAccuracy) %>%
  dplyr::mutate(Features = forcats::fct_inorder(Features)) %>%
  ggplot(aes(x = Features, y = MeanDecreaseAccuracy))+
    geom_bar(stat = "identity", fill = "white", color = "blue") +
    labs(x = "", y = "Mean decrease accuracy") +
    coord_flip() +
    main_theme

Rebuilding model by the optimal biomarkers

Here, we used the new candidates to rebuild the RF model, and the accuracy of the new model increased by 0.5% (the OOB error rate decreased from 4.26% to 3.76%).

selected_columns <- c("Group", imp_biomarker[1:optimal, ]$Features)

trainData_optimal <- trainData %>%
  dplyr::select(all_of(selected_columns))

testData_optimal <- testData %>%
  dplyr::select(all_of(selected_columns))

set.seed(123)
rf_fit_optimal <- randomForest(Group ~ ., data = trainData_optimal, importance = TRUE, proximity = TRUE)
rf_fit_optimal
Call:
 randomForest(formula = Group ~ ., data = trainData_optimal, importance = TRUE,      proximity = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 3.76%
Confusion matrix:
    B   M class.error
B 242   8  0.03200000
M   7 142  0.04697987
  • ConfusionMatrix
group_names <- c("B", "M")
pred_raw <- predict(rf_fit_optimal, newdata = testData_optimal, type = "response")
print(caret::confusionMatrix(pred_raw, testData_optimal$Group))
pred_prob <- predict(rf_fit_optimal, newdata = testData_optimal, type = "prob")  
Confusion Matrix and Statistics

          Reference
Prediction   B   M
         B 105   3
         M   2  60
                                          
               Accuracy : 0.9706          
                 95% CI : (0.9327, 0.9904)
    No Information Rate : 0.6294          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9367          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9813          
            Specificity : 0.9524          
         Pos Pred Value : 0.9722          
         Neg Pred Value : 0.9677          
             Prevalence : 0.6294          
         Detection Rate : 0.6176          
   Detection Prevalence : 0.6353          
      Balanced Accuracy : 0.9668          
                                          
       'Positive' Class : B 
  • performance of classifier
Evaluate_index <- function(
    DataTest, 
    PredProb = pred_prob, 
    label = group_names[1], 
    PredRaw = pred_raw) {
  
  # DataTest = testData
  # PredProb = pred_prob
  # label = group_names[1]
  # PredRaw = pred_raw
  
  # ROC object
  rocobj <- roc(DataTest$Group, PredProb[, 1])
  
  # confusionMatrix
  con_matrix <- table(PredRaw, DataTest$Group)
  
  # index
  TP <- con_matrix[1, 1]
  FN <- con_matrix[2, 1]
  FP <- con_matrix[1, 2]
  TN <- con_matrix[2, 2]
  
  rocbj_df <- data.frame(threshold = round(rocobj$thresholds, 3),
                         sensitivities = round(rocobj$sensitivities, 3),
                         specificities = round(rocobj$specificities, 3),
                         value = rocobj$sensitivities + 
                           rocobj$specificities)
  max_value_row <- which(max(rocbj_df$value) == rocbj_df$value)[1]
  
  threshold <- rocbj_df$threshold[max_value_row]
  sen <- round(TP / (TP + FN), 3) # caret::sensitivity(con_matrix)
  spe <- round(TN / (TN + FP), 3) # caret::specificity(con_matrix)
  acc <- round((TP + TN) / (TP + TN + FP + FN), 3) # Accuracy
  pre <- round(TP / (TP + FP), 3) # precision
  rec <- round(TP / (TP + FN), 3) # recall
  #F1S <- round(2 * TP / (TP + TN + FP + FN + TP - TN), 3)# F1-Score
  F1S <- round(2 * TP / (2 * TP + FP + FN), 3)# F1-Score
  youden <- sen + spe - 1 # youden index
  
  index_df <- data.frame(Index = c("Threshold", "Sensitivity",
                                   "Specificity", "Accuracy",
                                   "Precision", "Recall",
                                   "F1 Score", "Youden index"),
                         Value = c(threshold, sen, spe,
                                   acc, pre, rec, F1S, youden)) %>%
    stats::setNames(c("Index", label))
  
  return(index_df)
}

Evaluate_index(
    DataTest = testData, 
    PredProb = pred_prob, 
    label = group_names[1], 
    PredRaw = pred_raw)
         Index     B
1    Threshold 0.556
2  Sensitivity 0.981
3  Specificity 0.952
4     Accuracy 0.971
5    Precision 0.972
6       Recall 0.981
7     F1 Score 0.977
8 Youden index 0.933
  • AUROC
AUROC <- function(
    DataTest, 
    PredProb = pred_prob, 
    label = group_names[1], 
    DataProf = profile) {
  
  # ROC object
  rocobj <- roc(DataTest$Group, PredProb[, 1])
  
  # Youden index: cutoff point
  # plot(rocobj,
  #      legacy.axes = TRUE,
  #      of = "thresholds", 
  #      thresholds = "best", 
  #      print.thres="best")
  
  # AUROC data
  roc <- data.frame(tpr = rocobj$sensitivities,
                    fpr = 1 - rocobj$specificities)
  
  # AUC 95% CI
  rocobj_CI <- roc(DataTest$Group, PredProb[, 1], 
                   ci = TRUE, percent = TRUE)
  roc_CI <- round(as.numeric(rocobj_CI$ci)/100, 3)
  roc_CI_lab <- paste0(label, 
                       " (", "AUC=", roc_CI[2], 
                       ", 95%CI ", roc_CI[1], "-", roc_CI[3], 
                       ")")
  # ROC dataframe
  rocbj_df <- data.frame(threshold = round(rocobj$thresholds, 3),
                         sensitivities = round(rocobj$sensitivities, 3),
                         specificities = round(rocobj$specificities, 3),
                         value = rocobj$sensitivities + 
                           rocobj$specificities)
  max_value_row <- which(max(rocbj_df$value) == rocbj_df$value)
  threshold <- rocbj_df$threshold[max_value_row]
  
  # plot
  pl <- ggplot(data = roc, aes(x = fpr, y = tpr)) +
    geom_path(color = "red", size = 1) +
    geom_abline(intercept = 0, slope = 1, 
                color = "grey", size = 1, linetype = 2) +
    labs(x = "False Positive Rate (1 - Specificity)",
         y = "True Positive Rate",
         title = paste0("AUROC (", DataProf, " Features)")) +
    annotate("text", 
             x = 1 - rocbj_df$specificities[max_value_row] + 0.15, 
             y = rocbj_df$sensitivities[max_value_row] - 0.05, 
             label = paste0(threshold, " (", 
                            rocbj_df$specificities[max_value_row], ",",
                            rocbj_df$sensitivities[max_value_row], ")"),
             size=5, family="serif") +
    annotate("point", 
             x = 1 - rocbj_df$specificities[max_value_row], 
             y = rocbj_df$sensitivities[max_value_row], 
             color = "black", size = 2) +    
    annotate("text", 
             x = .75, y = .25, 
             label = roc_CI_lab,
             size = 5, family = "serif") +
    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
    theme_bw() +
    theme(panel.background = element_rect(fill = "transparent"),
          plot.title = element_text(color = "black", size = 14, face = "bold"),
          axis.ticks.length = unit(0.4, "lines"),
          axis.ticks = element_line(color = "black"),
          axis.line = element_line(size = .5, color = "black"),
          axis.title = element_text(color = "black", size = 12, face = "bold"),
          axis.text = element_text(color = "black", size = 10),
          text = element_text(size = 8, color = "black", family = "serif"))
  
  res <- list(rocobj = rocobj,
              roc_CI = roc_CI_lab,
              roc_pl = pl)
  
  return(res)
}

AUROC_res <- AUROC(
    DataTest = testData, 
    PredProb = pred_prob, 
    label = group_names[1], 
    DataProf = optimal)

AUROC_res$roc_pl

  • AUPRC
AUPRC <- function(
    DataTest, 
    PredProb = pred_prob, 
    DataProf = optimal) {
  
  # ROC object
  rocobj <- roc(DataTest$Group, PredProb[, 1])
  
  # p-r value 
  dat_PR <- coords(rocobj, "all", ret = c("precision", "recall"))
  
  # AUPRC data
  prc <- data.frame(precision = dat_PR$precision,
                    recall = dat_PR$recall)
  
  # plot
  pl <- ggplot(data = prc, aes(x = recall, y = precision)) +
    geom_path(color = "red", size = 1) +
    labs(x = "Recall",
         y = "Precision",
         title = paste0("AUPRC (", DataProf, " Features)")) +
    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
    theme_bw() +
    theme(panel.background = element_rect(fill = "transparent"),
          plot.title = element_text(color = "black", size = 14, face = "bold"),
          axis.ticks.length = unit(0.4, "lines"),
          axis.ticks = element_line(color = "black"),
          axis.line = element_line(size = .5, color = "black"),
          axis.title = element_text(color = "black", size = 12, face = "bold"),
          axis.text = element_text(color = "black", size = 10),
          text = element_text(size = 8, color = "black", family = "serif"))
  
  res <- list(dat_PR = dat_PR,
              PC_pl = pl)
  
  return(res)
}

AUPRC_res <- AUPRC(
    DataTest = testData, 
    PredProb = pred_prob, 
    DataProf = optimal)

AUPRC_res$PC_pl

systemic information

devtools::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       macOS Monterey 12.2.1
 system   x86_64, darwin17.0
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Asia/Shanghai
 date     2023-04-27
 rstudio  2022.07.2+576 Spotted Wakerobin (desktop)
 pandoc   2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package      * version    date (UTC) lib source
 assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.1.0)
 blogdown       1.13.3     2022-11-01 [1] Github (rstudio/blogdown@5dddefa)
 brio           1.1.3      2021-11-30 [1] CRAN (R 4.1.0)
 cachem         1.0.6      2021-08-19 [1] CRAN (R 4.1.0)
 callr          3.7.0      2021-04-20 [1] CRAN (R 4.1.0)
 caret        * 6.0-92     2022-04-19 [1] CRAN (R 4.1.2)
 class          7.3-20     2022-01-13 [1] CRAN (R 4.1.2)
 cli            3.4.1      2022-09-23 [1] CRAN (R 4.1.2)
 codetools      0.2-18     2020-11-04 [1] CRAN (R 4.1.2)
 colorspace     2.0-3      2022-02-21 [1] CRAN (R 4.1.2)
 crayon         1.5.0      2022-02-14 [1] CRAN (R 4.1.2)
 data.table   * 1.14.6     2022-11-16 [1] CRAN (R 4.1.2)
 DBI            1.1.2      2021-12-20 [1] CRAN (R 4.1.0)
 desc           1.4.1      2022-03-06 [1] CRAN (R 4.1.2)
 devtools       2.4.3      2021-11-30 [1] CRAN (R 4.1.0)
 digest         0.6.30     2022-10-18 [1] CRAN (R 4.1.2)
 dplyr        * 1.0.10     2022-09-01 [1] CRAN (R 4.1.2)
 e1071          1.7-9      2021-09-16 [1] CRAN (R 4.1.0)
 ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
 evaluate       0.17       2022-10-07 [1] CRAN (R 4.1.2)
 fansi          1.0.2      2022-01-14 [1] CRAN (R 4.1.2)
 farver         2.1.0      2021-02-28 [1] CRAN (R 4.1.0)
 fastmap        1.1.0      2021-01-25 [1] CRAN (R 4.1.0)
 forcats        0.5.1      2021-01-27 [1] CRAN (R 4.1.0)
 foreach        1.5.2      2022-02-02 [1] CRAN (R 4.1.2)
 fs             1.5.2      2021-12-08 [1] CRAN (R 4.1.0)
 future         1.28.0     2022-09-02 [1] CRAN (R 4.1.2)
 future.apply   1.8.1      2021-08-10 [1] CRAN (R 4.1.0)
 generics       0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
 ggplot2      * 3.4.0      2022-11-04 [1] CRAN (R 4.1.2)
 globals        0.16.1     2022-08-28 [1] CRAN (R 4.1.2)
 glue           1.6.2      2022-02-24 [1] CRAN (R 4.1.2)
 gower          1.0.0      2022-02-03 [1] CRAN (R 4.1.2)
 gtable         0.3.0      2019-03-25 [1] CRAN (R 4.1.0)
 hardhat        1.2.0      2022-06-30 [1] CRAN (R 4.1.2)
 htmltools      0.5.3      2022-07-18 [1] CRAN (R 4.1.2)
 ipred          0.9-12     2021-09-15 [1] CRAN (R 4.1.0)
 iterators      1.0.14     2022-02-05 [1] CRAN (R 4.1.2)
 jsonlite       1.8.3      2022-10-21 [1] CRAN (R 4.1.2)
 knitr          1.40       2022-08-24 [1] CRAN (R 4.1.2)
 labeling       0.4.2      2020-10-20 [1] CRAN (R 4.1.0)
 lattice      * 0.20-45    2021-09-22 [1] CRAN (R 4.1.2)
 lava           1.6.10     2021-09-02 [1] CRAN (R 4.1.0)
 lifecycle      1.0.3      2022-10-07 [1] CRAN (R 4.1.2)
 listenv        0.8.0      2019-12-05 [1] CRAN (R 4.1.0)
 lubridate      1.8.0      2021-10-07 [1] CRAN (R 4.1.0)
 magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.1.2)
 MASS           7.3-55     2022-01-13 [1] CRAN (R 4.1.2)
 Matrix         1.4-0      2021-12-08 [1] CRAN (R 4.1.0)
 memoise        2.0.1      2021-11-26 [1] CRAN (R 4.1.0)
 ModelMetrics   1.2.2.2    2020-03-17 [1] CRAN (R 4.1.0)
 munsell        0.5.0      2018-06-12 [1] CRAN (R 4.1.0)
 nlme           3.1-155    2022-01-13 [1] CRAN (R 4.1.2)
 nnet           7.3-17     2022-01-13 [1] CRAN (R 4.1.2)
 parallelly     1.32.1     2022-07-21 [1] CRAN (R 4.1.2)
 pillar         1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
 pkgbuild       1.3.1      2021-12-20 [1] CRAN (R 4.1.0)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.1.0)
 pkgload        1.2.4      2021-11-30 [1] CRAN (R 4.1.0)
 plyr           1.8.6      2020-03-03 [1] CRAN (R 4.1.0)
 prettyunits    1.1.1      2020-01-24 [1] CRAN (R 4.1.0)
 pROC         * 1.18.0     2021-09-03 [1] CRAN (R 4.1.0)
 processx       3.5.2      2021-04-30 [1] CRAN (R 4.1.0)
 prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.1.0)
 proxy          0.4-26     2021-06-07 [1] CRAN (R 4.1.0)
 ps             1.6.0      2021-02-28 [1] CRAN (R 4.1.0)
 purrr          0.3.4      2020-04-17 [1] CRAN (R 4.1.0)
 R6             2.5.1      2021-08-19 [1] CRAN (R 4.1.0)
 randomForest * 4.7-1      2022-02-03 [1] CRAN (R 4.1.2)
 Rcpp           1.0.10     2023-01-22 [1] CRAN (R 4.1.2)
 recipes        1.0.1      2022-07-07 [1] CRAN (R 4.1.2)
 remotes        2.4.2      2021-11-30 [1] CRAN (R 4.1.0)
 reshape2       1.4.4      2020-04-09 [1] CRAN (R 4.1.0)
 rlang          1.0.6      2022-09-24 [1] CRAN (R 4.1.2)
 rmarkdown      2.17       2022-10-07 [1] CRAN (R 4.1.2)
 rpart          4.1.16     2022-01-24 [1] CRAN (R 4.1.2)
 rprojroot      2.0.2      2020-11-15 [1] CRAN (R 4.1.0)
 rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.1.0)
 scales         1.2.1      2022-08-20 [1] CRAN (R 4.1.2)
 sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.1.0)
 stringi        1.7.8      2022-07-11 [1] CRAN (R 4.1.2)
 stringr        1.4.1      2022-08-20 [1] CRAN (R 4.1.2)
 survival       3.4-0      2022-08-09 [1] CRAN (R 4.1.2)
 testthat       3.1.2      2022-01-20 [1] CRAN (R 4.1.2)
 tibble       * 3.1.8      2022-07-22 [1] CRAN (R 4.1.2)
 tidyselect     1.1.2      2022-02-21 [1] CRAN (R 4.1.2)
 timeDate       3043.102   2018-02-21 [1] CRAN (R 4.1.0)
 usethis        2.1.5      2021-12-09 [1] CRAN (R 4.1.0)
 utf8           1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
 vctrs          0.5.1      2022-11-16 [1] CRAN (R 4.1.2)
 withr          2.5.0      2022-03-03 [1] CRAN (R 4.1.2)
 xfun           0.34       2022-10-18 [1] CRAN (R 4.1.2)
 yaml           2.3.6      2022-10-18 [1] CRAN (R 4.1.2)

 [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

Summary

Random Forest classification with optimal features using R.

  1. Data partition.
  2. Features importance.
  3. Cross validation for OOB errors.
  4. The optimal features for rebuilding model.
  5. Performance Evaluation based on optimal features.
  6. Finalize results with AUROC and AUPRC etc.

Reference

Next