Machine Learning Based Feature Selection Of Rna Seq Data For Covid 19 Classification.

Table of Contents

Background
Machine learning in RNA Seq analysis
Objective
Data source
Workflow
Installation of dependencies
Loading libraries
Reading the count data and experiment summary
Data Pre-processing and EDA
Differential expression analysis
Annotation, Functional analysis and Visualization
Feature selection
Random forest model with cross-validation
Conclusion

Background

Deoxyribonucleic acid (DNA), often called the molecule of life is found in almost all living organisms. According to central dogma of molecular biology, the genetic information contained within DNA in the form of genes is carried by Ribonucleic acid (RNA), which finally gets translated to protein (DNA -> RNA -> Protein).

RNA Sequencing (RNA Seq), a powerful technique which uses high-throughput Next Generation Sequencing (NGS), determines the quantity of RNA present within a biological sample. It further provides information on the expression levels of genes in response to physiological or pathological condition. For example, RNA Seq analysis performed on samples obtained from Covid infected and uninfected individuals sheds information on the human genes which are either up-regulated (higher expression) or down-regulated (lower expression) in response to infection. Identifying these differentially expressed genes (DEG’s) is the key to understand the biological differences between the healthy and diseased state and furthermore provides valuable insights into pathology of the disease. Moreover, the identified DEG’s can serve as candidate biomarkers for diagnostics or as therapeutic targets for treatment.

Machine learning in RNA-Seq analysis

Differential gene expression analysis yields huge list of genes considered significant in spite of stringent statistical tests and making the search of actual genes a daunting task. There are several challenges associated with this high dimensional data in terms of computational cost and memory. The problem could be overcome in part by utilizing the dimensionality reduction techniques of machine learning algorithms. Dimensionality reduction involves:

Feature selection - works by removing the redundant or irrelevant features and selects a subset of existing features without transformation
Feature extraction – selects the most relevant features and transforms the features into low dimensional space

Feature selection algorithms are further categorized as:

Filter methods – picks features based on correlation to target variable. Independent of any algorithm and generally used as preprocessing step.

Examples: Correlation coefficient, Chi square test, Fisher score

Wrapper methods – picks features based on classifier performance. Computationally expensive due to the repeated learning and cross validation.

Examples: Recursive feature elimination (RFE), Boruta, Genetic algorithms

Embedded methods - combines the qualities of both filter and wrapper methods.

Examples: Lasso regularization, Decision trees

Objective

To handle the curse of dimensionality of RNA Seq, two most effective feature selection algorithms - Boruta and RFE have been explored. Both are wrapper algorithms built around random forest classification algorithm. Finally, a random forest model is built with the selected features to eventually aid in the classification of Covid-19. The study can serve two important purposes:

DEG’s identified through feature selection can have great implication towards clinical aspects as biomarkers
Machine learning models built on selected DEG’s would be much more efficient in classifying Covid-19 than a model built with all the identified DEG’s

Data Source

Read counts for the study have been obtained from “126 RNA Seq datasets of COVID-19” from Kaggle. The data included counts of 191887 genes for 100 Covid affected and 26 control/healthy samples.

The Original data used in this application is sourced from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE157103 by Kaggle and transformed to a usable format. We have kept a zipped file in usable format which has been used in the code below. Kindly download the data from the link by clicking on Covid RNA Sequence Data.

Workflow

Entire work has been done using “R” environment. Differential expression analysis has been done with Limma Bioconductor package using the Voom method.

Workflow for Feature selection of RNA-Seq data for Covid-19 classification

Figure 1: Workflow for Feature selection of RNA-Seq data for Covid-19 classification

Installation of dependencies

        # We have used R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
        # make sure to install Rtools for Building R Packages, not essential but good to have
        # Install the Rtools from here https://cran.r-project.org/bin/windows/Rtools/
        writeLines('PATH="${RTOOLS40_HOME}\\\\usr\\\\bin;${PATH}"', con = "~/.Renviron")
        # restart RStudio
        # run this command to check 
        Sys.which("make")
        # you should see output like this , "C:\\rtools40\\usr\\bin\\make.exe"
        # you can test a installation by running this installation 
        # install.packages("jsonlite", type = "source") , although not essential
        install.packages("jsonlite", type = "source")
        install.packages("BiocManager")
        BiocManager::install("edgeR")
        BiocManager::install("Glimma")
        BiocManager::install("biomaRt")
        install.packages("gplots")
        install.packages("gdata")
        install.packages("caret")
        install.packages("Boruta")
        BiocManager::install("VennDetail")

Loading the required libraries

        library(edgeR)
        library(Glimma)
        library(limma)
        library(biomaRt)
        library(ggplot2)
        library(RColorBrewer)
        library(gplots)
        library(gdata)
        library(caret)
        library(Boruta)
        library(dplyr)
        library(VennDetail)
        library(data.table)
        library(DESeq2)

Reading the count data

    counts <- read.csv("COVID_RNA_Seq.csv")
    rownames(counts) <- counts$X
    counts$X <- NULL
    dim(counts)
191887    126

    head(counts)
                covid001 covid002 covid003 covid004  ...     control023 control024 control025 control026
ENST00000000233  725.157  481.218  722.342 610.7530  ...      3045.0000  1140.6400 2872.11000   1824.930
ENST00000000412    0.000    0.000    0.000   0.0000  ...         0.0000     0.0000    0.00000      0.000
ENST00000000442  164.537  127.731  142.824  76.3582  ...        12.5659    12.4255    3.65413    211.593
ENST00000001008  230.460  135.377  150.590 334.1970  ...       275.6140   429.5400  685.06200    482.911
ENST00000001146    0.000    0.000    0.000   0.0000  ...         0.0000     0.0000    0.00000      0.000
ENST00000002125    0.000    0.000    0.000   0.0000  ...        22.3258     0.0000    0.00000      0.000

Table 1: Input data with samples as columns and genes as rows

Reading experiment summary with sample information

    info <- read.csv("COVID_RNA_info.csv")
    group <- factor(info$Status)

Data Preprocessing

    # DGEList object is used by edgeR package to store count data. 
    counts <- DGEList(counts)

    # To check the genes which have zero counts across 126 samples
    table(rowSums(counts$counts==0)==126)

 FALSE   TRUE 
167220  24667

Note that, Around 13% genes have zero counts.

    # Filtering zero counts and other low expressed genes
    keep <- filterByExpr(counts, group=group)
    counts <- counts[keep, keep.lib.sizes=FALSE]
    dim(counts)
64401   126

We note that the counts have reduced from 191887 to 64401.

Exploratory Data Analysis

Plotting the library sizes

    barplot(counts$samples$lib.size, names=colnames(counts), las=2, ann=FALSE, cex.names=0.75)
    mtext(side = 1, text = "Samples", line = 4)
    mtext(side = 2, text = "Library size (millions)", line = 3)
    title("Barplot of library sizes")

Bar plot of library sizes of samples

Figure 2 : Bar plot of library sizes of samples

Plotting the log distribution of counts

    logcounts <- cpm(counts,log = TRUE)
    # Log distribution of first twenty samples
    boxplot(logcounts[,1:20],xlab="",ylab="Log normalized cpm",las=2)
    abline(h=median(logcounts),col='red')
    title("Boxplot of Log normalised cpm")

Bar plot of library sizes of samples

Figure 3: Box plot of Log normalised cpm of first twenty samples

Multidimensional scale plotting

    # PCA analysis to determine the source of variation in data
    labels <- paste(info$Sample.name, info$Status)
    glMDSPlot(counts, labels=labels, groups=group, folder="mds")

Figure 4: MDS plot of dimensions 1 & 2 and Variance of different dimensions

Normalization

    # Generate normalisation factors to be used downstream
    countdata <- calcNormFactors(counts)

Differential expression analysis

    # Creating Design matrix without intercept- indicates the comparisons to be made 
    design <- model.matrix( ~0+group)

    # Voom transforms the data based on normalization factors calculated in previous step
    v <- voom(countdata, design=design, plot=TRUE)

Plot of Voom transformed data based on normalization factors

Figure 5: Plot of Voom transformed data based on normalization factors

    # Plotting the voom normalized counts of first twenty samples
    boxplot(v$E,xlab="",ylab="Voom normalized cpm",las=2)
    abline(h=median(v$E),col='red')
    title("Boxplot of Voom normalised cpm")

Box plot of Voom normalised cpm of first twenty samples

Figure 6: Box plot of Voom normalised cpm of first twenty samples

    # Fitting the linear model and making contrasts
    fit <- lmFit(v, design)
    cont.matrix <- makeContrasts(infection=groupcovid-groupcontrol,levels=design)
    fit.cont <- contrasts.fit(fit, cont.matrix)

    # Calling the eBayes function to perform Bayes shrinkage on the variances
    fit.cont <- eBayes(fit.cont)

    # Summary of identified DEG’s
    diff <- decideTests(fit.cont,adjust.method = 'BH',method = 'global',p.value = 0.01,lfc = 2)
    summary(diff)

       infection
Down         607
NotSig     62733
Up          1061

Note, we can observe the Number of Significant up and down-regulated genes in the table above.

    # List of DEG’s whose p-value is < 0.01 and log fold change is > 2
    sig <- topTable(fit.cont,adjust.method = "BH",p.value = 0.01,lfc = 2,number = 10000)
    sig <- unique(sig)
    sig <- setDT(sig,keep.rownames = 'ensembl_transcript_id')

After removing duplicates, the number of DEG’s reduced to 1613.

Annotation of DEG’s

mart <- useMart(biomart = "ensembl",dataset = "hsapiens_gene_ensembl")
results <- getBM(attributes = c("ensembl_gene_id","ensembl_transcript_id","hgnc_symbol",'description'),filters = "ensembl_transcript_id",values = sig$ensembl_transcript_id,mart=mart)
final_annot <- merge(sig,results,all=TRUE,sort=FALSE)

Final annotation included 1613 DEG’s of 126 samples.

Getting Up and Down-regulated DEG’s

    up <- final_annot[final_annot$logFC > 2,]
    down <- final_annot[final_annot$logFC < 2,]
    up <- up[order(up$logFC),]
    print(up)

      ensembl_transcript_id    logFC    AveExpr        t      P.Value    adj.P.Val           B ensembl_gene_id hgnc_symbol                                                                       description
   1:       ENST00000354906 2.000701 -2.8160218 3.253365 1.460221e-03 9.951213e-03 -1.16577080 ENSG00000154305        MIA3             MIA SH3 domain ER export factor 3 [Source:HGNC Symbol;Acc:HGNC:24008]
   2:       ENST00000518382 2.001017 -0.2068060 7.841295 1.550888e-12 2.171278e-09 17.15288614 ENSG00000104738        MCM4 minichromosome maintenance complex component 4 [Source:HGNC Symbol;Acc:HGNC:6947]
   3:       ENST00000507960 2.002609 -0.8125808 3.537745 5.637281e-04 5.068281e-03 -0.34531675 ENSG00000002549        LAP3                      leucine aminopeptidase 3 [Source:HGNC Symbol;Acc:HGNC:18449]
   4:       ENST00000517947 2.002836 -0.1276449 3.636493 3.998144e-04 3.929530e-03 -0.04873638 ENSG00000145907       G3BP1         G3BP stress granule assembly factor 1 [Source:HGNC Symbol;Acc:HGNC:30292]
   5:       ENST00000485604 2.003443 -1.2675928 3.433180 8.052045e-04 6.541337e-03 -0.65184008 ENSG00000025293       PHF20                         PHD finger protein 20 [Source:HGNC Symbol;Acc:HGNC:16098]
  ---                                                                                                                                                                                                       
1013:       ENST00000391614 5.216978  3.7936011 5.442061 2.599326e-07 1.364297e-05  6.60287022 ENSG00000132465      JCHAIN        joining chain of multimeric IgA and IgM [Source:HGNC Symbol;Acc:HGNC:5713]
1014:       ENST00000633299 5.367055  0.7144479 3.779857 2.399643e-04 2.687175e-03  0.36634975 ENSG00000282745     IGHV6-1              immunoglobulin heavy variable 6-1 [Source:HGNC Symbol;Acc:HGNC:5662]
1015:       ENST00000526285 5.980966  0.1975059 4.045652 8.986362e-05 1.278680e-03  1.17194772 ENSG00000173020        GRK2             G protein-coupled receptor kinase 2 [Source:HGNC Symbol;Acc:HGNC:289]
1016:       ENST00000612474 6.532706  0.7547895 3.975473 1.169861e-04 1.553727e-03  0.95335256 ENSG00000276491    IGHV4-34             immunoglobulin heavy variable 4-34 [Source:HGNC Symbol;Acc:HGNC:5650]
1017:       ENST00000510614 6.676199  1.8495323 4.201154 4.954081e-05 8.118264e-04  1.71871372 ENSG00000132465      JCHAIN        joining chain of multimeric IgA and IgM [Source:HGNC Symbol;Acc:HGNC:5713]

    down <- down[order(down$logFC),]
    print(down)

     ensembl_transcript_id     logFC    AveExpr         t      P.Value    adj.P.Val         B ensembl_gene_id hgnc_symbol                                                                                     description
  1:       ENST00000587489 -5.700515  1.1134213 -8.823871 7.221547e-15 3.929708e-11 23.290907 ENSG00000184922       FMNL1                                                formin like 1 [Source:HGNC Symbol;Acc:HGNC:1212]
  2:       ENST00000502702 -5.494837 -0.6258168 -7.880262 1.257789e-12 1.975680e-09 18.206728 ENSG00000071127        WDR1                                          WD repeat domain 1 [Source:HGNC Symbol;Acc:HGNC:12754]
  3:       ENST00000511747 -5.445175 -0.8218833 -9.040463 2.166347e-15 1.550165e-11 24.201293 ENSG00000087266      SH3BP2                                SH3 domain binding protein 2 [Source:HGNC Symbol;Acc:HGNC:10825]
  4:       ENST00000354161 -5.255502  1.2435909 -9.186280 9.600430e-16 8.832533e-12 25.241268 ENSG00000146828     SLC12A9                           solute carrier family 12 member 9 [Source:HGNC Symbol;Acc:HGNC:17435]
  5:       ENST00000356909 -4.955808  2.1155965 -7.430572 1.381057e-11 1.046338e-08 16.009975 ENSG00000138964       PARVG                                                parvin gamma [Source:HGNC Symbol;Acc:HGNC:14654]
 ---                                                                                                                                                                                                                     
592:       ENST00000378543 -2.003592 -2.6714107 -4.394128 2.317643e-05 4.474176e-04  2.448801 ENSG00000162585      FAAP20                       FA core complex associated protein 20 [Source:HGNC Symbol;Acc:HGNC:26428]
593:       ENST00000533334 -2.003192  0.3647045 -5.782388 5.372443e-08 4.037231e-06  8.109110 ENSG00000105397        TYK2                                           tyrosine kinase 2 [Source:HGNC Symbol;Acc:HGNC:12440]
594:       ENST00000519276 -2.002662  2.7983683 -3.314567 1.195481e-03 8.632151e-03 -1.419526 ENSG00000164332      UBLCP1          ubiquitin like domain containing CTD phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:28110]
595:       ENST00000533784 -2.000853 -0.9185426 -3.312018 1.205544e-03 8.674701e-03 -1.079109 ENSG00000158555       GDPD5 glycerophosphodiester phosphodiesterase domain containing 5 [Source:HGNC Symbol;Acc:HGNC:28804]
596:       ENST00000529867 -2.000117 -3.1892896 -4.244543 4.184459e-05 7.167110e-04  1.900040 ENSG00000167792      NDUFV1               NADH:ubiquinone oxidoreductase core subunit V1 [Source:HGNC Symbol;Acc:HGNC:7716]

Up-regulated DEG’s were 1017 and down-regulated were 596 in number.

Functional analysis of DEG’s (Gene ontology)

Gene ontology analysis of the identified DEG’s was done using David functional analysis tools. Most of the DEG’s have been implicated in the host immunity and immune response pathways.

Functional GO analysis of identified DEG’s

Figure 7: Functional GO analysis of identified DEG’s

Visualization of DEG’s

Heat-map

    diffgenes <- v$E[diff[,1]!=0,]
    mycol <- colorRampPalette(colors=c("green","white","blue"))(100)
    hr <- hclust(as.dist(1-cor(t(diffgenes), method="pearson")), method="complete")
    mycl <- cutree(hr, k=2)
    mycolhc <- rainbow(length(unique(mycl)), start=0.1, end=0.9) 
    mycolhc <- mycolhc[as.vector(mycl)] 
    heatmap.2(diffgenes, Rowv=as.dendrogram(hr), Colv=NA, 
              col=mycol, scale="row", labRow=NA,
              density.info="none", trace="none", RowSideColors=mycolhc, 
              cexRow=1, cexCol=1, margins=c(8,22))

Heat-map of DEG’s between Covid and control samples

Figure 8 : Heat-map of DEG’s between Covid and control samples

Volcano plot and MDS plot

    sample.cols <- c("limegreen", "purple")[group]
    glXYPlot(x=fit.cont$coef, y=fit.cont$lod, xlab="logFC", ylab="logodds",status=diff,groups=group,sample.cols = sample.cols)
    glMDPlot(fit.cont,status = diff[,"infection"], values=c(-1,1),cols = c("yellow","blue","magenta"))

Figure 9 : Volcano plot and MDS plot depicting DEG’s

    # Getting counts of selected DEG's
    counts <- read.csv("COVID_RNA_Seq.csv")
    colnames(counts)[1] <- 'ensembl_transcript_id'
    diff_counts2 <- merge(final_annot,counts,all=TRUE,sort=FALSE)
    diff_counts2 <- slice(diff_counts2,(1:1613))
    col <-c('ensembl_transcript_id','logFC','AveExpr','t','P.Value','adj.P.Val','B','ensembl_gene_id','description')
    diff_counts2 <- diff_counts2 %>% select(-one_of(col))

    # Transpose such that DEG's (hgnc_symbol) are columns and samples as rows
    col_name <- diff_counts2$hgnc_symbol
    diff_counts_t <- as.data.frame(t(diff_counts2[,-1]))
    colnames(diff_counts_t) <- col_name
    table(duplicated(colnames(diff_counts_t)))
    diff_counts_t <- diff_counts_t[,!duplicated(colnames(diff_counts_t))]
    diff_counts_t$Condition <- info$Status
    write.csv(diff_counts_t,'data.csv')

On removing duplicates, the final count of DEG’s was 1291.

Feature selection of DEG’s

Now that the DEG’s were identified, feature selection was applied to reduce dimensionality.

1. Boruta algorithm was run on the above identified DEG’s (1291) with maximum runs of 500.

    data <- read.csv("data.csv")
    rownames(data) <- data$X
    data$X <- NULL
    bor <- Boruta(factor(Condition)~., data = data, maxRuns=500,doTrace=0)
    boruta_df <- attStats(bor)
    boruta_genes <- boruta_df[boruta_df$decision=='Confirmed',]
    par(mfrow=c(1,2))
    plot(bor$finalDecision, col=c('red','green','yellow'))
    plotImpHistory(bor)

94 attributes were confirmed important by the Boruta algorithm

Plot of Final decision and ImpHistory of Boruta algorithm

Figure 10 : Plot of Final decision and ImpHistory of Boruta algorithm

2. Recursive feature elimination was run on the identified DEG’s using the function of random forest along with ten folds of repeated k-fold cross validation (using caret package).

    x <- data[,1:1291]
    y <- data[,1292]
    y <- factor(y)
    control <- rfeControl(functions = rfFuncs,method = 'repeatedcv',repeats = 10, number = 5)
    res <- rfe(x,y,sizes=c(1:5,10,15,20),rfeControl=control)
    rfe_genes <- data.frame(feature = row.names(varImp(res))[1:100],
                            importance = varImp(res)[1:100, 1])

    # Plotting the importance of first ten DEG's
    varimp_data2 <- data.frame(feature = row.names(varImp(res))[1:10],
                               importance = varImp(res)[1:10, 1])
    ggplot(data = varimp_data2, 
           aes(x = reorder(feature, -importance), y = importance, fill = feature)) +
           geom_bar(stat="identity") + labs(x = "Features", y = "Variable Importance") + 
           geom_text(aes(label = round(importance, 2)), vjust=1.6, color="white", size=4) + 
           theme_bw() + theme(legend.position = "none")

Top 100 attributes confirmed important by the RFE algorithm were chosen

Plot of Top 10 Feature importance of RFE algorithm

Figure 10 : Plot of Final decision and ImpHistory of Boruta algorithm

DEG’s commonly selected by Boruta and RFE

To find out the number of DEG’s that are common to both algorithms, Venn analysis was done.

    boruta_genes$genes <- rownames(boruta_genes)
    ven <- venndetail(list(BORUTA=boruta_genes$genes,RFE=rfe_genes$feature))
    plot(ven)

Plot of Top 10 Feature importance of RFE algorithm

Figure 11 : Venn diagram of genes common to Boruta and RFE

78 genes were commonly selected by both the algorithms. Next, we would build a random forest model with the selected features/genes for the classification of Covid-19.

Random forest model with ten-fold cross validation

A random forest model with ten-fold cross-validation was built using the caret package. Four models were built so as to assess the importance of feature selection.

a. Model 1 with all identified DEG’s (no feature selection was applied)

    # Data includes 1291 DEG's of 126 samples
    data$Condition <- factor(data$Condition)
    trControl <- trainControl(method = "cv",number = 10,search = "grid")
    rf_crossval <- train(Condition~.,data = data,method = "rf",metric = "Accuracy",trControl = trControl)

b. Model 2 with genes selected by Boruta algorithm

    # 94 DEG's selected by Boruta algorithm
    # Sub setting Boruta selected genes from data
    bor <- copy(rownames(boruta_genes))
    bor_rf <- data[,c(bor)]
    bor_rf$Condition <- factor(info$Status)
    trControl <- trainControl(method = "cv",number = 10,search = "grid")
    rf_crossval_boruta <- train(Condition~.,data = bor_rf,method = "rf",metric = "Accuracy",trControl = trControl)

c. Model 3 with genes selected by RFE algorithm

    # 100 DEG's selected by RFE algorithm
    # Sub setting RFE selected genes from data
    rfe <- copy(rfe_genes$feature)
    rfe_rf <- data[,c(rfe)]
    rfe_rf$Condition <- factor(info$Status)
    trControl <- trainControl(method = "cv",number = 10,search = "grid")
    rf_crossval_rfe <- train(Condition~.,data = rfe_rf,method = "rf",metric = "Accuracy",trControl = trControl)

d. Model 4 with genes common to both algorithms

    # 78 DEG's selected by both algorithms
    # Sub setting the shared genes from data
    both <- intersect(boruta_genes$genes,rfe_genes$feature)
    both <- copy(both)
    shared <- data[,c(both)]
    shared$Condition <- factor(info$Status)
    trControl <- trainControl(method = "cv",number = 10,search = "grid")
    rf_crossval_bor_rfe <- train(Condition~.,data = shared,method = "rf",metric = "Accuracy",trControl = trControl)

Sensitivity, Specificity and AUC score were obtained individually for each of the model by including class Probabilities and two class Summary in train Control and by changing the metric to “ROC”.

Evaluation of Models

Performance of the models in classifying Covid-19 was assessed using the metrics: Accuracy, Sensitivity, Specificity and Area under ROC curve (AUC).

    rf_crossval$results

  mtry  Accuracy     Kappa AccuracySD   KappaSD
1    2 0.9288462 0.7453651 0.04414863 0.1684855
2   50 0.9288462 0.7378999 0.05713165 0.2150076
3 1291 0.9057692 0.6510883 0.07028908 0.2894309

    rf_crossval_boruta$results

  mtry  Accuracy     Kappa AccuracySD   KappaSD
1    2 0.9685897 0.8865753 0.05443059 0.2025097
2   47 0.9294872 0.7565949 0.06757004 0.2351235
3   93 0.9141026 0.7018150 0.07639897 0.2613480

    rf_crossval_rfe$results

  mtry  Accuracy     Kappa AccuracySD   KappaSD
1    2 0.9756410 0.9129717 0.03926046 0.1443566
2   51 0.9365385 0.7723934 0.06313009 0.2315501
3  100 0.9442308 0.8043868 0.05506440 0.1996204

    rf_crossval_bor_rfe$results

  mtry  Accuracy     Kappa AccuracySD   KappaSD
1    2 0.9615385 0.8830818 0.05439283 0.1604058
2   42 0.9294872 0.7566853 0.05701566 0.1984533
3   83 0.9294872 0.7566853 0.05701566 0.1984533

Performance metrics of the four models

Figure 12 : Performance metrics of the four models

As evident from figure 12, Models 2 & 3 with roughly 90-100 predictors/DEG’s performed better than Model 1 with 1291 predictors. Model 4, with just 78 predictors achieved an outstanding AUC score of 1.00. Thus, feature selection efficiently reduced the dimensionality without compromising the variance in the data and further strengthened the overall performance of the machine learning model.

The relative feature importance of the top 20 DEG’s commonly selected by both algorithms is depicted below.

    plot(varImp(rf_crossval_bor_rfe),col=c('blue','red','green','yellow'),top=20)

Ranking of the Top 20 DEG’s

Figure 13 : Ranking of the Top 20 DEG’s

Conclusion

In conclusion, it could be highlighted that machine learning based feature selection plays a vital role in identifying key DEG’s from the high dimensional RNA-Seq data which further helps in better understanding of the pathologic condition and to differentiate the diseased from the healthy. It is also noteworthy to mention that the selected DEG’s can serve as potential biomarkers for diagnostic and therapeutic purposes.

About the Author:

Annapurna Satti

Annapurna Satti, an accomplished Life Sciences professional, is an Alumni of University of Hyderabad. She has more than 15 years of research and development (R&D) experience in eminent academic institutes such as Center for DNA Fingerprinting and Diagnostics (CDFD) and Center for Cellular and Molecular Biology (CCMB). As a machine learning enthusiast, she aims to leverage Artificial Intelligence (AI) tools to boost and accelerate research in the field of biotechnology.

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC.Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.

Machine learning based Feature selection of RNA-Seq data for Covid-19 Classification

Plotting the library sizes

Plotting the log distribution of counts

Multidimensional scale plotting

Heat-map

Volcano plot and MDS plot

Modal

Are you sure to delete this information ?