“Navigating Machine Learning” / Illustration by Riley Parsons

How machines are able to help you find a parking spot, a great place to stay, and the next medication you might take

by: Riley Parsons
Bioinformatics Intern,
twoXAR

These three accomplishments are all possible today because of machine learning.

Machine learning continues to disrupt markets and transform peoples’ everyday lives. Yet, the public is far removed from the actual technology that drives these changes. To many, the idea of machine learning may elicit images of complex mathematical formulas and sentient robots. In fact, many of the general ideas behind machine learning are approachable to a wider audience. Understanding the basic concepts in this field can help to dispel common misconceptions and build intuition for what is happening “under the hood”.

Before dissecting the essential topics in this area of study, it is necessary to address the terminology that surrounds it. An amalgam of buzzwords and marketing terms (think artificial intelligence, neural networks, and deep learning) surround the subject of machine learning. These words tend obfuscate the actual science and limit wider understanding. Although terms like “artificial intelligence” are frequently conflated with “machine learning”, consistent language will be used throughout this piece. “Machine learning” is a vivid descriptor of the fundamental processes behind this interdisciplinary field and thus will be used to describe it.

Foundations

The idea of machine learning isn’t brand new- in fact it was first coined by Stanford professor Arthur Samuel in 1959. Samuel is said to have described it as “a field of study that gives computers the ability to learn without being explicitly programmed”.

Machine learning algorithms solve optimization problems. Specifically, they attempt to identify a “statistical model” that best describes a set of data. Machines “learn” the best model by repeated iteration on a previous result. Perhaps the best known statistical model is linear regression or the creation of a “line of best fit”. While optimizing a linear regression model in two dimensions (e.g. blood pressure vs. age) is trivial, creating a model for data that contains more variables requires more computational power. As Moore’s Law continues to hold true, computers have become faster at solving these optimization problems. As efficiency of machine learning increases, it will continue to penetrate industries and disrupt traditional paradigms.

Data

The pipeline of data from research databases, governments, and connected devices has driven growth in machine learning during the 21st century. Machine learning algorithms live and die by the data they consume. Because of that, special care goes into managing this data from which algorithms learn. Data is extracted, transformed, and loaded into a standardized and predictable structure or “schema” before it is entered into an algorithm.

The University of California Irvine hosts many data sets in their Machine Learning Repository. One dataset comprises information on diabetic patients admitted to 130 US hospitals over 10 years. The data contains rows of patient data, with each column representing a “feature” or attribute the patient. These features include race, gender, age, and time spent in the hospital. The data also contains information on the medications the patient was taking at the time of admission. This information can be used as a “target” for a machine learning algorithm. The resulting statistical model would attempt to predict which drug a diabetic patient is likely to be prescribed based on other features.

Although out of the scope of this piece, it is important to note that there is a subcategory of “unsupervised” machine learning algorithms. This piece has a good description of the the difference between supervised and unsupervised machine learning in its second paragraph.

Once data is in a structure that algorithms understand, there is one more thing to do: the creation of “training” and “testing” datasets. This means some of the available data is used to fit the model and the rest is reserved to test the model’s performance. This test will determine whether a model is predictive when exposed to “real-world” data.

Fine-tuning an Algorithm

A machine learning algorithm must be appropriate for the problem it attempts to solve. Should the fitted model classify data into discrete groups or return a continuous number? The decision between a classification and regression model is fundamental in machine learning and statistics. A classification problem attempts to categorize a sample, based on its features, into a specific group (e.g. patients that take a certain drug versus those that do not). Regression algorithms use a sample’s features to return a continuous numeric value (e.g. the percentage chance that a patient will be prescribed a certain drug).

Machine learning algorithms cannot be fit on training data and produce reliable predictions “out of the box”. Further input is needed to control the algorithm’s approach to fitting the model. Fine-tuning the settings, or “hyperparameters”, of an algorithm can alter the model’s predictive power. For example, these settings might control how many features in the training set are included as predictors in the final statistical model. It is important that the selected set of hyperparameters leads to the most predictive model based on the test set of data.

Prediction

It is easy to fit a machine learner on training data and then generate accurate predictions using the features from the same data as input. This might create some sense of accomplishment, but it says nothing about the generalizability of the model. The model establishes its predictive power using the held out set of test data. The ultimate goal is to develop a statistical model that is predictive even when presented with an unseen set of features.

The transition across the “train-test split” is like translating what you learn in the classroom into accuracy on a closed-note exam. As a machine learner fits a model on the training data, it runs the risk of “overfitting”. When a model is too sensitive to, or “memorizes”, random noise and idiosyncrasies in training data, it is less predictive on test data. Just like an overfit model, the student who only memorizes answers to practice problems will not pass the exam.

Extracting Value From Data

Machine learning is a broad and complex topic that borrows from many disciplines. Discussions in the media and public domain have helped to build a kind of mythology around the field.

With an enormous dataset of location specific driving behavior and mapped cities, how can Google predict the parking difficulty in a specific neighborhood?

Airbnb stores massive amounts of data on past bookings, their users, and available properties. How can the team create tailored search results that yield higher booking rates?

There is an immense amount of publicly and privately available biomedical data encompassing protein-drug interactions, pharmaceutical attributes, and clinical records. How can twoXAR leverage these discrete data sets to predict novel drug candidates for any given disease?

Through an understanding of the fundamental theories behind machine learning, it is clear how data scientists can address these complex problems with optimized predictive models.