When data shifts: Getting your ducks in a row

Machine learning (ML) is all the rage nowadays, with people captivated by its ability to make sense of voluminous amounts of data on its own. But the outputs of ML, and the degree to which it makes sense of data, depend to a large extent on how well we can manage the variety and changing nature of the data itself.

Often, for example, someone will have access to two related but distinct datasets. A farmer may have agricultural sensor datasets from farms in two different locations. A biologist may have datasets for the same experiment performed at different lab locations — which may use different equipment. And a medical researcher may have health datasets from two separate hospitals, while a company accumulates sales data before and during a promotional event.

The ML practitioner may want to merge these datasets into one dataset for combined analysis. For example, given the cost of lab experiments, a biotechnology researcher could leverage as much data as possible in an analysis by merging datasets from multiple labs into one joint dataset.

A data scientist also can “translate” one dataset to make it resemble another — providing a way to hypothesize about what a dataset would look like if it were generated under the other circumstances. For instance, the data scientist might ask: What would the plants at Farm A look like if they had been planted at Farm B?

We also may want to transfer knowledge from one dataset to another, if one set of data may have extra information that can be useful in the other dataset. As an example, a medical researcher may have some particular diagnosis information in the dataset from one hospital that has not been captured by another hospital’s dataset.

ML engines learn and adapt on their own, analyzing patterns of data to draw conclusions. The goal is to predict the output, given the input — which can be inaccurate if the input data at model deployment has shifted even slightly from the dataset used for training the ML model, compromising the model’s performance.

In this latter case, an ML model can first be trained to predict, say, a medical diagnosis on the first dataset; then, if the second dataset is translated to look like the first dataset, the ML model can be used to predict diagnosis on the second dataset. More generally, this method can be applied whenever the new data flowing into the model is slightly different from the dataset that was used to train the model — a situation known as “dataset shift.”

Our research group has been exploring new approaches to traditional ML methods to solve these types of dataset shift problems, so ML can “reason” in a probabilistic way in many real-world applications. The elegance of the algorithms and the processing power of the platform engine are not the only things that matter. “Wrangling” and “harmonization” of data also are vital to ML’s ability to unlock the gargantuan datasets of today’s connected world, in order to understand and predict outcomes in healthcare, agriculture, biology and other domains.

Results of our work will help machine learning evolve from “static ML,” in which one model is trained on one dataset and fails on other datasets, to “dynamic, adaptable ML,” which adapts to the ever-changing landscape of real-world input data, earning its stripes as the smart engine of innovation.

David I. Inouye, PhD
Assistant Professor, School of Electrical and Computer Engineering
Purdue Engineering Initiative in Data and Engineering Applications
College of Engineering, Purdue University

Related Links

Professor David Inouye’s research and teaching website

Research to bring more secure software for autonomous battlefield operations

Professor David Inouye wins CRISP Center’s Rising Star Award