The New School: How Machine Learning Levels Up from Traditional Modelling

Rufaro Samanga
Palindrome Data
Published in
5 min readFeb 23, 2022
Image: Getty Images/iStockphoto

The value of statistical models cannot be underestimated. The ability to predict specific health outcomes with a certain level of confidence in public health is invaluable. From informing better resource allocation to developing the relevant programmatic interventions to ensuring patients receive the differentiated care they require; models are at the heart of data analysis.

Now, depending on whether you ask a traditional statistician or a data scientist about how to go about building a model, you may get some contrasting views. While they may both agree on the popular aphorism “all models are wrong, but some are useful,” they will often differ on what the correct practical steps are in delivering a final predictive model. And it’s in better understanding some of these contrasting approaches that the so-called “old school” and “new school”, in terms of modelling, can be bridged together rather than remaining on opposite sides of the fence.

In this article, we’ll compare aspects of traditional statistical modelling to the capabilities of machine learning. We’ll also explore how machine learning techniques can be harnessed to build more useful models.

Model building fundamentals

A model, by definition, is a mathematical expression that encompasses a set of explanatory variables (‘features’) of interest which, given certain statistical assumptions and parameters, are used to predict a specific outcome. For example, a model predicting viral load in HIV+ patients may have several explanatory variables of interest including age, sex, adherence to treatment, and duration of treatment regimen. These variables or inputs are then used to predict a specific outcome or output, in this case viral load.

In traditional modelling, there are two main ways to go about building a model. The first approach uses input data (explanatory variables of interest) along with statistical assumptions about the data and specific calculations to determine what algorithm is best fitted to describing the given data. The second approach uses a specific set of explicit rules, for example “if the input is X then I expect Y”, to predict a specific outcome using the explanatory variables. Instead of making use of statistical calculations, the expert knowledge of, say, public health or medical professionals is used to inform these rules in order to determine the prediction algorithm.

In contrast, the machine learning approach to developing a model looks a little different. There are two primary steps involved when building a model, the first of which includes training the model using the ‘training’ dataset, which includes inputs and known outputs associated with said inputs and a training algorithm. The process, involving hyperparameter tuning, is iterative with the eventual result being a prediction algorithm. After the model has been trained and tested, it can now use the prediction algorithm to transform new unseen input data into predicted outputs.

Variable selection in models

Additionally, how a set of inputs (or explanatory variables) are selected is also another aspect that can be compared between traditional modelling and machine learning. In traditional modelling, which remains at the heart of statistical analysis in the academic research setting, previous literature is often relied on to determine which explanatory variables should ideally be present in the model. These could be variables of interest, potential confounders, or interaction terms.

Say for instance, we’re looking to predict HIV patients who are at risk of experiencing an interruption in their antiretroviral treatment and care. Depending on the kind of data at our disposal using traditional modelling, there are a number of variables that will go into the model, such as age, sex, treatment regimen duration, geographic location of the patient, etc., but not necessarily all of those that were in the initial dataset. An array of variable selection methods, whether manual (conducted by the individual themselves) or automated (various backward and forward selection approaches), are used to determine important variables for a model. While we may start off with a dataset of hundreds of variables, these may ultimately be reduced to just 20.

Automated feature selection is often criticised as variables with clinical relevance may be “kicked out” of the model simply because they are not statistically significant. For instance, it may be well known in the medical community that patient viral load (VL) is an important variable in any model that attempts to accurately predict interruption in treatment. However, the nature of the data may be such that VL is omitted from the model because of a p-value greater than 0.05 although the model would prove more useful had that particular variable been retained in the model. It is for this reason that traditional model building often forces variables with clinical relevance into the model despite a lack of statistical significance.

In machine learning, however, feature construction techniques may also be used to create new variables. For example, using a standard date variable to create a variable for season or day of the week or month. During or after model training, data scientists may use one of several techniques to identify the most important input features by calculating a ‘feature importance’ score, which may help to reduce dimensionality and improve model performance. These techniques include correlation coefficients, decision trees, and permutation importance scores.

Admittedly, in spite of some pretty sophisticated statistical software and their respective packages, selecting variables as inputs for models in the traditional sense is often a little clumsy and more of an “art” than it is a science. In light of this, machine learning presents a number of opportunities to level up when developing predictive models that are more accurate, less reliant on subjective judgments of the individual building the model and potentially making the process itself a lot more reproducible.

The reliance on univariate and bivariate statistics in traditional modelling is a hindrance compared to the capacity for machine learning to address the compound effect of having multiple, interacting variables, especially in biological sciences.

Pushing back against resistance

While machine learning has amassed considerable popularity and acceptance in the finance and tech space, there is still a level of resistance with regards to its application in public health and the general medical community. This may be because machine learning is often perceived as an “impersonal” process in fields that are admittedly “personal” by virtue of the fact that they deal with people and their wellbeing — and often intimately so. But this doesn’t have to be the case because at the heart of the questions that machine learning is attempting to answer are people. Additionally, data scientists in these fields of study are not only aiming to improve predictive power but also to optimise the explainability, interpretability, bias, and fairness (towards different demographics and subpopulations) in the model.

Rather than seeing the world of predictive analytics as old school versus new school or traditional modelling versus machine learning, how can the two worlds be a lot more collaborative? How can we leverage the power and capabilities of machine learning to level up aspects of traditional modelling that don’t work as well as we might like?

--

--

Rufaro Samanga
Palindrome Data

Rufaro holds a Master’s degree in epidemiology and biostatistics from the University of the Witwatersrand. She is also a culture and gender-politics journalist.