(ML)²: Myths and Legends of Machine Learning

by: Aaron Daugherty, PhD (Senior Scientist, twoXAR)
&
Riley Parsons (Bioinformatics Intern, twoXAR)

“Handling the black box” / Illustration by Riley Parsons

Skepticism is (and should be) a vital part of any science; statistics and data science are no exception. Statistician George Box nicely summed it up when he said, “all models are wrong, but some are useful”. Box reminds us that statistical models are just that: models. A simplified representation of the real-world will always have shortcomings. But we shouldn’t forget the last bit of Box’s saying: “some [models] are useful”. Although challenging, carefully constructed statistical models can be extremely useful.

In our last post, we gave a brief introduction to machine learning: what it is and how it works. In giving that high-level introduction, we skipped over some of the limitations of machine learning, but we’re back today to address those gaps. To drive the discussion, we will use some of the skepticism we’ve encountered when describing our own technology and cover the following questions:

  • Isn’t real-world data too messy?
  • Isn’t machine learning just a black box?
  • So, data drives everything?

Authors’ note: for a high-level overview on how our technology works, read the “So how did we do it?” section in our previous post: Synergizing against breast cancer.

Isn’t real-world data too messy?

The concern here is that if you have too much messy data or noise (random/inconsistent data), it will drown out the signal (the data answering your question). For example, if there is an airplane taking off next to you, you aren’t going to be able to hear your friend talking to you. In data science, there is absolutely a point at which real-world data is too messy. However, machine learning enables use of some pretty messy data. You still need to run all your input data through thorough quality control checks before using it, but even a little signal amidst the noise is valuable, and this is where the importance of overlapping evidence comes into play.

At twoXAR, we use diverse data sources and methods to describe various facets of a disease or drug. Each method attempts to find a signal in “messy” input data. As a result, each method has its own shortcomings, but also its own predictive power. Each of these independent methods can be weak predictors, but where we gain our power is by combining the results of these methods using a machine learning framework. Aggregating these features further distills and reveals overlapping insights from heterogeneous data. Then, the final prediction contains drugs that are most likely to improve the disease or symptom we have targeted.

For another take on overlapping signals overcoming messy data, see this recent article on Stanford professor Purvesh Khatri.

Isn’t machine learning just a black box?

Sometimes all you need from a machine learning algorithm is an answer and you don’t care how it was calculated. For example, when looking for a movie to watch on Netflix, do you care how a recommendation was generated? Frankly, I don’t care how an algorithm finds it, I just want a movie I like. This isn’t the case when you’re working in drug discovery. When there are no insights into how data contributed to a drug prediction, we are missing essential information on how to translate that drug prediction into a medicine. Conversely, by providing the evidence supporting a prediction, we are generating testable hypotheses — the key to successful science.

The rationale for wanting to avoid a black box prediction in drug discovery is clear, but is it possible to do so? Absolutely! While some machine learning models are black boxes, there are plenty of statistical models that are much easier to interpret (e.g. logistic regression, decision trees, and many more).

At twoXAR, we improve interpretability through the layering of data and methods. Our algorithms connect drugs to disease in a transparent and comprehensible manner. For example, if a drug targets two proteins which are misexpressed in a disease, our algorithms may score that drug highly, but we are also able to see how much each protein contributed to that drug’s high score.

Put another way, as data moves through our pipeline, it is aggregated by algorithms that are clear and interpretable. The transparency of this process gives insight to how a drug might be scored highly by one algorithm (e.g. one that uses chemical structure), but lower by another (e.g. one that uses genetic data). Therefore in our final predictions, we not only report what drugs are most likely to treat a disease or symptom but also the aspects of the drug that led to the prediction.

So, data drives everything?

There can definitely be too much of a good thing, and in fact many have noted that being blindly data-driven can actually be detrimental (here and here for example). However, combining data-derived insights with good old-fashioned human expertise can prove to be the most powerful approach.

In drug discovery, which has historically been a human expertise-driven field, concern around over-reliance on machine learning is legitimate. We don’t want algorithms to only identify mimics of existing treatments, or focus on only treating symptoms rather than the underlying ailment (e.g. a pain killer for rheumatoid arthritis). Instead, we want to be able to pragmatically fill specific needs in the treatment of a disease, such as developing a drug which acts through a different mechanism than existing treatments or which modifies the disease.

Accordingly, the twoXAR team has developed our technology in a way that can be tailored to fill specific needs. The key is that our algorithms augment expert knowledge — not replace it or ignore it. While we leverage machine learning as a tool to ingest and identify patterns from massive amounts of data (the “busy work”), we rely on our scientists to interpret machine generated results and put forward subsequent hypotheses.

For example, “steering” of predictions can be done by tuning our individual algorithms (i.e. finding the optimal settings). Most machine learning methods have settings that can modify how they run, and thus, their outcome; ours are no different. This is analogous to identifying which in vivo assay best models a disease and then making sure that assay is working robustly. As we tune our algorithms the results can be visualized and inspected at every step. This is why we always need a trained scientist working with the algorithms to interpret the results and make sure we are headed in the right direction of filling unmet need.

Wrapping up

In today’s society, you don’t have to look far to understand the power of machine learning. Computers can rapidly generate statistical models that are simplified representations of real processes. These simplifications inherently introduce error into these models. However, that doesn’t mean such models are useless. Throughout the process of developing twoXAR’s drug discovery platform, we have maintained a healthy skepticism regarding the limitations of computerized models. This mindset has guided us to create a robust set of algorithms in a scalable software infrastructure that confronts common concerns of machine learning. This in turn enables our scientists to rapidly generate data-informed hypotheses for testing in the lab.