Accounting for Biased Data in Machine Learning : Malaria

Published in

KC AI Lab, LLC

6 min readOct 10, 2018

Recently KCAIL partnered with a group of researchers at Kansas State University to use machine learning methods to predict the potency of various compounds in preventing malaria. This area of research is called Quantitative Structure-Activity Relationship (QSAR). A QSAR model uses the chemical properties of a compound to predict its biological activity. Because the process of determining potency for compounds is very time-consuming and expensive, our goal was to narrow down the list of potentially potent features to just a few that were more likely to be potent than the others.

We were asked to predict the potency of 23 compounds using 1,444 given properties (the test set). Our training data consisted of 47 compounds with the same 1,444 given properties (the training set). We were also given a subset of properties from a few thousand other compounds. In most circumstances, given the high-dimensionality and small sample size, we would ask for more training data. However, in this case, the lack of data was the very problem we were attempting to solve. If it was feasible to determine potency for all 23 compounds, there would be no reason for us to predict it. In addition to the small sample size, the 47 compounds with known potencies were compounds that researchers chose specifically because they had reason to believe that those compounds could be potent. This introduces selection bias into the training data.

We recognized these two big problems early on: our training set was a biased sample and very small relative to the number of features. Because gathering more data was not an option, we had to account for these problems in our model.

Accounting for a Biased Training Set

Because the 47 compounds in our training set were chosen specifically because the researchers had reason to believe that those compounds could be potent, we can assume they have higher potencies than all known compounds. Therefore we cannot apply a model trained with this data to all compounds — we can only apply the model to similar compounds. How do we know if the 23 compounds in our test set are similar enough to the compounds in our training set?

Our first step was to visually represent all known compounds. Using T-SNE for dimensionality reduction, we can visualize the properties of all compounds. A scatter plot of this data confirmed our suspicions. This shows the compounds in the test set, the training set, and a random sample of other compounds with known chemical properties. Compounds clustered together have similar features.

visual representation of known compounds

This clearly shows the bias in our training set. Applying our model to the vast majority of compounds would be a mistake, as we would be extrapolating beyond the scope of the model. Fortunately, the compounds in our test set look a little better (the train set is in orange, and the test set is in blue). The test and train compounds tend to clump together except for the four compounds in the top right-hand corner.

While the test set and the train set look somewhat similar, we can do more to ensure that the training set is good representation of the test set by selecting only features that are similar within both the test set and train set. We did this by using the Kolmogorov-Smirnov test of two samples. The Kolmogorov-Smirnov test is a non-parametric test that determines if two samples come from the same distribution. Before we ran a feature selection algorithm, we looped through each feature and ran the Kolmogorov-Smirnov test to compare the distribution for that feature in the test set to the distribution for that feature in the train set. Our hypotheses were as follows.

Null Hypothesis: The test and train set are sampled from the same distribution

Alternative Hypothesis: The test and train sets are sampled from a different distribution

We rejected the null hypothesis when our p-value was less than 0.1. When the null hypothesis was rejected, we determined that there was enough evidence to suggest that the feature in the test and train set came from different distributions and we excluded it from the model.

We chose 10% as our significance level because a less conservative test was sufficient given the number of features we had to choose from. This ensures lower probability of committing Type 2 Error (where we include the feature in the model even though the test and train sets come from a different distribution), while the probability of committing Type 1 Error (where we exclude the feature from the model even though the test and train sets have the same distribution) is higher.

Here is the Python code that tests all the features:

from scipy import stats as st

filtered_features = list()
a = 0.10

for f in df.columns:
ks = st.ks_2samp(df.loc[test, f], df.loc[train, f])
# if p-value > a, add to list
if ks[1] >= a:
filtered_features.append(f)

Accounting for a Small Training Set

After narrowing down the list of features, we ran a feature selection algorithm and fit a model to our selected features. Unfortunately, due to the small sample size, the accuracy of each model was highly volatile and the high dimensionality made it very likely that our models were over-fitted. We concluded that due to these limitations, we could not rely on one model that would accurately predict potency. Our solution was to run our model many times to derive confidence intervals for the predicted potency of each compound. This gave us as well as our stakeholder higher confidence in our results despite the limitations of the data. It also gave us the ability to show visually which compounds we had high confidence in predicting potency for and which compounds we had low confidence in predicting potency for.

Results

The results are shown in the visualization here. Our models produced very wide ranges of values for some compounds and smaller ranges for others. We expected a large range of values from compounds that were vastly different from the compounds in the train set. In fact, there are four compounds for which the model failed dramatically. These were the four compounds that were positioned far from all the other compounds in our TSNE visualization above.

Note: IC50 is the measure of potency — a high IC50 value indicates low potency, while a low IC50 value indicates high potency.

We used this same technique to determine confidence intervals for feature importance (visualized here). We were able to estimate feature importance through the percentage of all models that selected that feature and through the coefficients of each feature for each model. We derived a confidence interval for the estimated coefficient from the coefficients from each of the fitted models.

Overall, we learned that while biased data and small sample sizes are a problem in machine learning, this does not mean that machine learning methods cannot be used at all. We just need to understand all the limitations of our data in order to understand the limitations of our model.

To see our findings and the code we produced, here is the GitHub repository for you to read, clone, and build from.

If you want to become a data superhero, send us a message:

kcail.com

Accounting for Biased Data in Machine Learning : Malaria

Accounting for a Biased Training Set

Accounting for a Small Training Set

Results

Written by Alexs Thompson