Application of Machine Learning Algorithms in Modeling the Role of the Microbiome in Colorectal Cancer Diagnosis and Therapy: Part 2

Bioinformatics Framework design and Methodology - Machine Learning Modelling Results for the colorectal cancer drug-resistance mechanism

Published in

Towards Data Science

9 min readDec 26, 2022

Photo by National Cancer Institute on Unsplash

This article is a follow-up to the introductory part, where I presented the research methodology and the bioinformatics framework design for observing the colorectal cancer drug-resistance mechanism and carcinogenesis. The main scientific aim was to design and develop a comprehensive bioinformatics framework and machine learning pipelines of a two-phase methodology for modelling and interpreting the key biomarkers that can play a significant role in understanding the therapy-resistant mechanism and carcinogenesis for patients diagnosed with colorectal cancer. Taking into consideration that I have already presented the dataset demographics and data-related processing and transformation operations, here I will proceed to elaborate on the results of the drug-resistance case study. This group consisted of 21 samples from patients with Newly Developed Adenoma (NDA), associated as resistant, and the rest of the 26 representatives from patients with a Clean Intestine (CIT), associated as not resistant, accordingly.

Following the design of the methodology, I will present and generally elaborate on the ML modelling and statistical analysis results retrieved after I executed every building block of the implemented framework.

ML Modeling Screening Phase Results

The modelling screening phase labelled as ‘algorithm benchmark analysis’ is of huge importance since no gold standard is available for processing and presenting reliable results regarding the microbiome data’s bioinformatics analysis. I used and prototyped the well-known Scikit learn’s supervised learning classifiers in this phase. Therefore, the data was randomly shuffled and divided into two separate datasets for training (70%) and testing (30%). Additionally, I have also tried the k-fold cross-validation method before creating some of the models.

The screening modeling phase results are summarized in the following table:

Image by Author - ML Screening phase (algorithm benchmark analysis)

Since the idea behind the screening phase was to explore and provision the most promising approach determined by the maximized accuracy metric, I concluded that the most promising insight was using the Decision Tree approach, achieving a preliminary overall accuracy value of 0.764. Using the decision tree ( ‘gini’ attribute selection measure in correlation with the ‘best’ splitter as splitting strategy approach) provides additional benefit since the advantageous characteristic of decision trees is their comprehensibility. Although it has a simple visualization representation, this approach is beneficial because it forces the root split by some feature abundance distributions. This is very important considering the nature of the study, where we need an appropriate biological interpretation based on the model behavior itself. In terms of this, I continued the modelling utilizing the tree-based Random Forest algorithm, assuming that the performance metrics would be additionally improved by taking advantage of the tree-related majority voting.

ML Modeling Results

Considering that bioinformatical working environments are not standardized, I assumed it was essential to test and explore the Random Forest algorithm in different circumstances with different initial states. Thus, I applied the practical ML modelling utilizing the Random Forest classifier implementations from two different experimental environments, Python-based Scikit-learn and KNIME. Therefore, I tried different data normalization and scaling techniques, splitting ratio and classifier parameters to provision and maximize models’ performance metrics. I designed the process following the two-phase strategy, using the first stage’s most significant features as a narrowed input scope for the second phase. The main idea of this concept was to identify and observe the most significant features resulting from the second phase.

After doing the data normalization and scaling, I calculated Cronbach`s alpha and Cohen`s kappa coefficients, respectively. The Cronbach`s alpha coefficient value thresholds can be explained based on the following stages: Early stage of research (0.5 or 0.6/0.7); Applied research 0.8; When making an important decision 0.9. Usually, Cronbach`s alpha value > 0.75 is considered acceptable for microbiome-related studies. On the other hand, Cohen`s kappa coefficient is determined by the following stages: <0.4 is considered poor; 0.4–0.75 is considered moderate to good; >0.75 represents excellent data agreement. The results from these calculations are presented in the table below:

Image by Author - Cronbach`s alpha and Cohen`s kappa coefficients for the resistant and non-resistant CRC post-operative individuals’ groups

The general ML modelling performance metrics for the resistant and non-resistant CRC post-operative individuals’ group are presented in the following table. Besides the accuracy, I also calculated the models` sensitivity and specificity as significant indicators for the model behavior and predictiveness. These studies usually consider these metrics since high accuracy does not always mean the model is accurate (not biased or overfitted).

Image by Author - General ML modeling performance metrics for the resistant and non-resistant CRC post-operative individuals’ groups

It is worth emphasizing that for the first modelling phase, I used the following algorithm parameter values: n_estimators = 55, max_depth = 5, max_features = 3, with cross-validation value of 25% test data using the stratified sampling by additionally introduced ‘resistance’ target feature. Conversely, for the second phase, I configured the n_estimators = 25, max_depth = 4, max_features = 3, with cross-validation value of 25% test data.

Additionally, I calculated the Area Under the Curve (AUC) value, which generally represents an aggregated measure of the performance of a binary classifier on all possible threshold values (reasonably discriminated ability to classify).

Image by Author - AUC value for the resistant and non-resistant CRC post-operative individuals’ groups

I also decided to calculate the Precision, Recall and F1-Score (alternative machine learning evaluation metric that assesses the predictive skill of a model by elaborating on its class-wise performance rather than an overall performance as done by accuracy) metrics for both subgroups, respectively. The results are displayed in the following table:

Image by Author - Detailed ML modeling performance metrics for the resistant and non-resistant CRC post-operative individuals’ group

In terms of this, I also tried XGBoost and AdaBoost algorithms, which resulted in no significant improvements compared with the forest-based approach described above. Therefore, I identified the second-phase Python-based random forest classifier as the most performant and selected the resulting most important features as a reference set for further statistical analysis.

Statistical Analysis Results

The taxonomic analysis of the raw data, assuming the improved taxonomical precision since the bacterial references are constantly changing, resulted in 3603 different bacterial taxonomic units detected. Thus, the gut microbiome consisted of 20 unique phyla, 35 classes, 72 orders, 119 families, and 259 unique genera with additional genus-level data explored. The taxonomy on the genus level was unavailable for 1506 bacteria (3603/1506; 41.7%). From the remaining bacteria (2097; 58.2%), the most significant genera among the resistant samples belong to the statistically calculated Benjamini-Hochberg p-value interval from 0.009 to 0.024.

Thus, in the resistant group, I found the Bacteroides (0.009) and Lachnoclostridium (0.017) as genera biologically interesting for further analysis and interpretation. Accordingly, the most significant genera among the non-resistant samples belong to Benjamini-Hochberg p-value interval from 0.001 to 0.047. In the non-resistant group I found the Ruminococcus (0.002), Lachnospiraceae FCS020 group (0.019), Desulfovibrio (0.012) and Clostridium sensu stricto 1 (0.016).

I completed the general insights picture providing the statistical analysis results for genera abundances in resistant and non-resistant groups visualized in the following diagram:

Image by Author - Median abundances for the most significant genera in resistant and non-resistant groups

Highly Contributing Features

The comparison for the resistant and non-resistant groups of samples presented a total of 86 unique genera. Subsequently, there were 28 separated by the ML algorithm from these genera as the most important features (32.6%) ranking in an interval of statistically calculated Benjamini-Hochberg p-value from 0.002 to 0.049 between the groups. I observed the most significant differentiation between the resistant and non-resistant groups in the following genera: Ruminococcus, Oscillospiraceae-UCG-002, Eubacterium eligens group, Barnesiella, Bacteroides, Oscillospiraceae group, Desulfovibrio, Oscillospiraceae-UCG-005, Clostridium sensu stricto 1, Lachnoclostridium, and Lachnospiraceae FCS020 group (0.002, 0.003, 0.005, 0.007, 0.009, 0.010, 0.012, 0.014, 0.016, 0.017, and 0.019 p-values respectively).

Aggregated Features Contribution Analysis Results

This novel approach’s main aim was to explore what genera are mostly seen together and how they jointly contribute to the resistance class. According to the stochastic nature of the algorithm, the aggregated contribution analysis can be done multiple times, considering all generated models following the same performance metrics as the referent one. The benefit of the proposed aggregate analysis supports the thesis that resistance is not due to only a specific pathogenic genus in the patient microbiome but several bacterial genera that live in symbiosis. As expected, the aggregated contributions are lower than the individual ones but uncover additional data insights regarding the constitution of the entire trajectory along the algorithm’s prediction path.

The detailed aggregated features significances supporting the resistance behavior (contribution to the resistance class prediction) are presented in the following table:

Image by Author - Aggregated bacteria significance contributions to the resistant class

Accordingly, the detailed aggregated significances supporting the not resistance behavior (contribution to the not resistance class prediction) are presented in the following table:

Image by Author - Aggregated bacteria significance contributions to the not resistant class

The aggregated contribution relations establish a fundamental ground for more profound future scientific research.

* The full observations and finding can be found in the original publication.

Bacterial Abundance Results

I used the initially generated OTU tables to create a potential metabolomics profiling with the iVikodak workflow. Although this type of inference should be performed from the meta transcriptomics datasets, they can still give us insights into their potential roles in specific KEGG pathways. According to species abundance level, we can assume the influence of metabolites produced by the bacteria and their impact on the cellular mechanisms.

The abundance frequency patterns covered in the analysis and segregated according to the diagnosis and control groups are visually presented in the following diagram:

Image by Author - Genera abundance frequency patterns segregated by diagnostic and control groups

Besides the already mentioned genera specific to the resistant or non-resistant representatives, the observed abundances show that some bacteria are present only in specific groups such as Parasutterella and Lachnospira that are found only in the control group. Therefore, mentioned bacteria are known to participate in the everyday protein catabolism in the colon of humans.

Considering the bacterial abundances, the bacterial abundance tendency in the non-resistant samples is summarized in the table below, where p-values were calculated using the Benjamini-Hochberg statistical method.

Image by Author - Bacterial abundance tendency in the non-resistant samples

Biological analysis and interpretation

The most frequent genus among the microbiome samples that we analyzed with our algorithm, Bacteroides, is already published in several studies that have a significant association with human CRC development. This genus has been identified as an important feature of the model we used to compare resistant/non-resistant in favor of the resistant group (p = 0.003, mean abundance 28). The enterotoxigenic Bacteroides bacteria have a critical impact on CRC development and proliferation, considering their biofilm production for colonization that results in a series of inflammatory reactions that encourages chronic intestinal inflammation and tissue damage. Moreover, the functional studies on mice verified that enterotoxigenic Bacteroides could directly promote intestinal carcinogenesis.

In this context, the Alistipes bacteria, which is significantly increased in the non-resistant group, is living in symbiosis with the Bacteroides species because both are resistant to vancomycin, kanamycin, and colistin. These two species have similar pathways for amino acid fermentation supporting colon inflammation and adenoma development.

Additionally, the most compelling genus with the highest p-value was Ruminococcus. This genus is in favor of non-resistant patients. This study highlights the fundamental role of gut microbiota in cancer development and progression along with chemotherapy outcomes. Understandably, the Barnesiella species shows a high correlation with the non-resistant group since its metabolites indicate infiltration of interferon-γ-producing γδT cells in cancer tissues. Furthermore, it is shown that this species can interfere with the impact of anticancer and immunomodulatory agents and prevent cancer treatment.

The resistance mechanism bacterial function table composed from the study is summarized and discussed in detail within the following table:

Image by Author - Summary of the resistance mechanism bacteria functions

It is worth emphasizing here that although we are familiar with the single impact of one genus in the patient microbiome, we are still far from answering why several genera are frequently found together and if the resistance is based on the presence of one genus or the presence of several genera together.

Thank you for being so interested in reading this article. The next one will follow the identical principle, but for the second case study, related to samples that share the same histology information for tubular adenoma.

Part 1 - Introductory article - Bioinformatics Framework design and Methodology Overview

Part 3 - Bioinformatics Framework design and Methodology - Machine Learning Modelling Results for understanding the colorectal cancer carcinogenesis

Originally published at https://www.linkedin.com.