Studying Bacteria with Tabular Data

Savion Watson
CUNY CSI MTH513
Published in
8 min readMay 9, 2022

Introduction

While machine learning is associated with calculations and analysis for very technical fields such as artificial intelligence or engineering, it can also be used to directly aid in biological studies. For example, a recent study led by Ryan Wood demonstrated how computer simulations were able to accurately predict antibiotic-resistant bacteria from lossy, compressed data of many bacteria species.

With the world’s population becoming more resistant to antibiotics, identifying which bacteria contribute to this could allow scientists to develop more effective medicine. As a means of testing Wood’s methods, and improving our machine learning skills, our research team (me, Andi Kolari and Adam Bougaev) competed in an online competition on Kaggle.com, a data science website. Through the Tabular Playground Competition, we applied our knowledge of machine learning using Scikit-learn and Pandas to identify bacteria species based on Wood’s data.

Competition Background

As mentioned, the data from Wood’s study was lossy. This meant that only 10-mer snippets of DNA were sampled and used to identify ten other species of bacteria. The snippets create a vague histogram or sequence, without a specific order to their molecules. The molecules can only be 10-mers long, and there are four base molecules: Adenine “A”, Cytosine “C”, Guanine “G”, and Thymine “T”. Thus, there were 286 possible different sequence combinations.

Combination calculation, where order doesn’t matter and repeats are allowed.

Our initial data consisted of a testing set (test.csv), a training set (train.csv), and a sample submission (sample_submission.csv) to Kaggle. Each of the possible combinations of molecules were columns in both test.csv and train.csv. Every combination had a measurement error relative to a DNA sample. The bacteria species themselves were identified with a column labeled “target” in the training set. There were 10 possible species:

  • Streptococcus Pyogenes
  • Streptococcus Pneumoniae
  • Salmonella Enterica
  • Enterococcus Hirae
  • Escherichia Coli
  • Escherichia Fergusonii
  • Campylobacter Jejuni
  • Bacteroides Fragilis
  • Klebsiella Pneumonia

The DNA samples had identification numbers on each row, with the sample_submission.csv consisting of just these rows and the “target” column.

Truncated Sample of Tabular Dataset

Methods and Research Process

Up to this point in our Machine Learning class, our team was not particularly well versed with categorical data. Our only prior competition was Titanic — Machine Learning from Disaster in which we were tasked with simply predicting which passengers would have survived the infamous sinking of the Titanic. Since we had only four features to work with, and most features were categorical, we were able to run a simple Random Forest Classifier.

However, the Tabular Playground Competition had many more categories than the Titanic dataset, each with a small numerical rate. All of the categories were also used to predict one of ten possible outcomes, not just “Survived” or “Did Not Survive”. Every possible combination of molecules was undoubtedly a factor in determining species, but the sheer amount of them created a large dimension that made the problem intimidating to approach. Our biggest challenge was finding efficient ways to accurately analyze such a large dataset.

Modeling, and Transforming

One-Hot Encoding

Initially, we tried to use One-Hot Encoding to transform our model before testing. One-Hot Encoding is the act of creating new columns to show presence of specific categories. It works by taking categories of no particular ranking, or nominal variables, and simplifying them with binary indications of presence. We were inspired to use this after skimming through lessons on the “Intermediate Machine Learning” course Kaggle offers.

Since the columns were simply different combinations and we wanted and easier way to digest the measurement errors, this seemed logical. However, as we were implementing, we began to realize how problematic it would be.

Code implementation

To start, ironically, our large spread made difficult for One-Hot Encoding to be used effectively. When taking a more critical look at our Kaggle course during implementation, we were suggested to limit our categories to just 15, even if they were all nominal.

Furthermore, even if we did have less categories, One Hot Encoder seemed work off of binary categories. Simply rounding each measurement error to 1 or 0 would eliminate the needed nuance of the dataset, since the they acted as a form of presence for each type of combination in each sample. Multiple species could have more of one combination than others, so a binary value for the presence wouldn’t make sense.

Going forward, we decided to leave the initial data alone and simply find a working model to elaborate on.

K-Nearest Neighbors

Next, we looked into K-Nearest Neighbors, since it was a method we had recently learned at the time. K-Nearest Neighbors functions by measuring the distance between similar data points and making decisions based on a specified maximum proximity of possible points. Since we were trying to identify one DNA species for each sample, we felt that it was appropriate to use a K-Nearest Neighbors classifier.

This was also problematic as although the data would ultimately be discrete-one sample could not identify with more than one DNA species- our inputs were once again too vast. Every additional input on K-Nearest Neighbors increases the number of dimensions needed to measure distance. The dataset had a whopping 286 unique strands to analyze, and since all the strands needed to be analyzed in tandem for each sample, it would not be ideal to use only one strand at a time for each classifier.

Code Implementation

This unfortunate property we’ve been struggling with is known colloquially as the Curse of Dimensionality, where high numbers of dimensions create even more axes for neighboring points to be measured. If these points cannot be found adequately in every dimension, K-Nearest Neighbors will not function properly.

Random Forest Classifier

Ultimately, we found that Random Forest Classification was the most appropriate model to use. Random Forest Classification is an extension of Decision Tree Classifiers which create tree diagrams that branch from possible decisions to ultimately reach a conclusion. Having too many branches lead to overfitting, where the model is too specific to our dataset, and too little branches lead to underfitting, where the model is too general to apply to any dataset.

Given how many dimensions he had to to work with, simply using a Decision Tree with many leaves could lead to this, but, Random Forest Classifier accounts for this by taking the majority results of multiple trees. Despite initially testing with forests (or models) using 100 trees, we found the best values from just two trees (n_estimators=2), likely due to the sheer amount of leaves each tree would have. This finally led us to an accuracy score of .71457.

Code Implementation

It became clear that the Curse of Dimensionality was proving to be quite real, and making it difficult to approach our challenge, much less improve our strategies.

If Ryan Wood and his team were able to draw important conclusions by simplifying parts of their data, we could likely do the same. With some help from our professor, we successfully reverse-engineered Wood’s compression process and extracted the most relevant information from the original data, without actually changing it.

Data Reconstruction and Principal Component Analysis

If just putting in our raw dataset was too much, yet no details could be omitted, we had to change the way we interpreted our data.

First, we decided to retrieve the more precise data from Wood’s experiment. While the challenge was to work with the lossy data, none of the rule prohibited reverse engineering, and since we worked with the lossy data anyway to approximate the original instead of using another file, it didn’t feel like cheating.

In the original study, bias values were subtracted to simplify the data. After dropping duplicates values, the biases were found once more by applying a multinomial distribution to each combination column.

Duplicate Dropping
Multinomial Distribution Formula
Formula Implementation
Output snippet of biases

These biases were subtracted from the original data measurements, and divided by 10k in the experiment. So we added them back and re-scaled the data. The original paper also notes that each sample took a different number of trials to reproduce. With this in mind, the more trials needed to reach a sample, the less variance its measurements has. The quantity of trials was estimated using the Greatest Common Denominator (GCD) of the measurement errors for each sample.

To visualize our data, we used Principal Component Analysis (PCA). Principal Component Analysis works by capturing maximum variance through a linear projection map. Maximizing variance is important since it shows the most vital patterns in our data.

Since the map is only a projection of our data, we receive a small set of features based on our data points while still preserving its shape for proper analysis. None of the original data was actually changed.

PCA Implementation

While all the data points are important, having 20,000 samples each with 286 unique measurements creates over 5 million data points. It was integral that we minimized noise and maximized variance so that the data points we do utilize gave us the best predictions possible. Doing PCA helped illuminate the positive relationship between variance and GCD with graphs:

Low GCDs with High variance
High GCDs with Low variance

Finally, to calculate our score, the GCD most closely associated with each sample was applied to aid in its species prediction with an Extra Trees Classifier. It is similar to a Random Forests Classifier, but is better suited for points with low variance, since it uses random splits instead of best splits from a random subset. Our reduced dimensions also meant we could process many more trees (n_estimators = 300) in a reasonable amount of time.

This yielded much more accurate results, with an excellent accuracy score of .96932

Conclusion

Overall, when studying large datasets, it is important to understand when and how to extract the most crucial information to your task. Due the sheer size and complexity of our data, our team went from:

  • Misinterpreting data and applying unfit models
  • Grasping nature of data and applying a simple model
  • Getting more accurate measurements (despite being challenged to work with lossy ones)
  • Reducing data to its most essential parts
  • Applying reduction to a more fit model

Without pushing ourselves to truly address our task, we would have never have been able to test our skills, receive such high scores and break the Curse of Dimensionality.

--

--