5 differences between Machine Learning and Statistical Modeling

The role of AI in scientific research

Richard Nagyfi
Cursor Insight
10 min readSep 17, 2019

--

Although Machine Learning is rooted in statistics (there are models based purely in Bayesian inference, for example: the Naive Bayes Classifier was widely used for e-mail Spam filtering in the late 90s), these models were created to tackle problems different from statistical modeling. They were created as means of processing large amounts of data from unknown populations, without much, expensive human interference.

Sadly, there are a lot of unfulfilled promises in what AI can actually do, and there is a general hostility towards Machine Learning models in many disciplines (especially in health care). This lengthy article aims to remove some of the confusion and stigmas surrounding the technology, by highlighting the main differences between Machine Learning and Statistical Modeling, and explaining both the actual limitations and strengths of the former.

The use of “AI” or “Machine Learning” could be misleading in the sense, that there is no one single algorithm that solves everything, or could be used to tackle all kinds of problems. All models have their strengths and weaknesses and niche use-cases, where they excel, depending on the resources and data available. There isn’t a single model, that can always be a step in a research process. In this article, the comparisons imply Machine Learning models that best fit a related problem.

1. Different goals

Statistical Modeling and Machine Learning achieve different goals, and are responses to different needs. Statistical tests were developed to understand whether differences in sample distributions are likely to be caused by chance or carry significance. They are used to decide whether a hypothesis actually holds given the evidence. Data is collected through carefully designed experiments, so the effect of certain events can be independently measured. Machine Learning models in the other hand, handle large, but biased samples of unknown populations, that can even change over time. Sample collection is relatively cheap, or can even be a byproduct of other processes. The goal is to create models that “learn” from training samples to generalize well to real world examples (the actual population). This is achieved by finding patterns in the samples and learning relationships between multiple variables. Machine Learning models therefore do not require a hypothesis a priori, and can give answers to unasked questions that were previously seemed irrelevant or insignificant (like whether people eat more pop-tarts during hurricanes). Despite its large amounts, the data available for modeling is usually only tangentially related to the research questions, and needs to be transformed before it is suitable for modeling. The limits to experiment design are compensated by the getting even more data from different sources.

2. Amounts of data points and features available

The amount of data available for Data Science tasks is usually many magnitudes larger than sample sizes used in most research. More data is usually better, however most Machine Learning models plateau in performance after a while. The number of input variables (features) are also higher and even non-linear relationships can be modeled between them. More features are usually better, but even Machine Learning models suffer from the curse of dimensionality, which means that simply adding more and more features will eventually result in degrading performance, as the chance of finding patterns in unrelated events (that are basically just conjunctions of random noise) increases. It is a myth, that Data Scientists just blindly add more raw features to models to make them automatically better.

Statistical tests were developed to handle samples, not populations and larger sample sizes could find small effects to be more significant (this does not mean that statistical approaches do not work on larger samples). Most statistical tests assume normal distribution as a result of the central limit theorem, Machine Learning models can work with all kinds of distributions. Some models can also work with smaller (n<100) sample sizes. There are even models that can function with a lot of missing data or be trained for rare events with only a few samples among a large population. The latter is called anomaly detection (for example credit card fraud detection), which signals whenever some chain of events seem to be out of the ordinary (for detecting events that are so rare that they would be probably discarded as outliers with statistical methods).

A lot of a Data Scientist’s work is wrangling data: taking the available data and cleaning it, combining multiple features, filling in missing values, removing noise, creating more complex features out of previous ones and transforming them to other formats, that suit the selected Machine Learning models. This can take up to 80% of the work and requires some domain knowledge from the field where the data originates from, in order to do it successfully. Each transformation alters the outcome, so interpretation is not totally objective. The Data Scientists have to be careful not to accidentally add their own biases when transforming the data. It is a myth, that raw data is totally objective, as even sensors produce some noise instead of measuring true values. Understanding behavior through digital footprints can be more reliable than Likert scale responses, but they still aren’t totally objective, due to unexpected errors during data collection.

3. Different evaluation methods

For Machine Learning tasks, it is possible to measure error rates between the desired and received outcomes, or count the number of each successful classification on a test dataset and calculate its accuracy. However, since the real population is unknown, all evaluation scores are just estimations of how well the model should do on future data. Therefore, having a high accuracy score alone could be misleading, especially if the number of examples in each class were unbalanced (if there are 99 negative examples and 1 positive example in the sample, a model that simply returns “negative” all the time would reach 99% accuracy). Having a “perfect” looking model is rarely the goal of a Data Scientist, as more complex models can simply memorize the dataset, instead of learning to generalize from its underlying patterns and be useful in the long term. Balancing between overfitting and underfitting the data requires skill and a lot of tinkering.

Even when more complex models yield better accuracy scores, they are often discarded in production, as they could be too expensive to implement and maintain. When selecting a Machine Learning model, engineers must also consider the time and resources needed to train and to run a specific model: some models are trained quickly but take longer to make predictions. Others, like Neural Networks take a lot of time to train, but can give answers almost immediately afterwards. Therefore there is no single number (like p-values or effect sizes) that can capture the usefulness of a Machine Learning model. There are several evaluation scores instead, that need to be interpreted depending on context. The field is more empirical, and models in production are constantly updated when new information becomes available, based on their previous successes. Data Science is more about results than finding significant relationships, especially in business environments, where sometimes being able to make decisions quickly is more important than being totally right.

4. Explainability and interpretability

Linear regressions are easy to understand and to explain, but they are oversimplifications for modeling cause and effect. Sure, the more gasoline I have in my car, the more distance I can proportionally travel, but there are extreme cases, when my tank would start leaking or my car would fall off a cliff, resulting in unexpected lengths of traveled distances. These events would most likely be discarded as outliers in a dataset, allowing more robust, generalized models to be created, while also ignoring their existence and possible importance. The regression line also extrapolates to negative values, meaning that having a tank less than empty would allow me to drive backwards. Attributing cause and effect to just one or two variables could be misleading. Taking more of a certain medication might make everyone with a certain condition feel better, except for a few patients who are allergic to its components, who get far worse when taking it. Can one honestly simplify this relationship to a positive correlation? What if the patients who will become allergic to the drug are not yet born, so there is no way they could have been included in the sample? Would the described positive effects be still valid even after their birth? Please note that I am not against the scientific method here, nor am I trying to disapprove properly conducted research, but there are obvious limitations to how deeply linear models can explain the world, if everything we eat both causes and cures cancer at the same time.

Unfortunately there is a trade off in modeling more complex relationships and being able to simply explain how these models work. In fact, us humans are terrible at explaining why we actually do things. Or how we do them (could you explain how to ride a bicycle to someone who never rode one?). More complex models of the world will always be harder to interpret. Machine Learning models are mostly black boxes that could make decisions similar to humans’, but for totally different reasons. Even if it is possible to explain how a model comes to a conclusion, it is usually due to some weighted combination of variables which mostly makes little to no sense to humans. (Decision Trees are somewhat an exception from this rule, but their default versions are seldom the best choice for a Machine Learning problem as they easily overfit data and easily learn noise.) What most models can instead do is giving insight into feature (input variable) importance. Instead of explaining the effect of a variable, its “relative usefulness” can be understood in the decision making process.

So what’s the point of having black box systems that comes up with decisions on their own through a combination of several variables, of which many might not even be closely related to the problem? Well, some properties are easier to measure than others, while some are harder or more expensive to observe. Users on streaming sites will not explicitly rate every single movie they have watched, but will implicitly give hints, depending on how many times they have skipped, stopped or rewinded certain scenes, etc. It might not be possible to sample viewers from all over the world on their tastes, and measure whether a five star rating on one movie correlates with them enjoying another, similar title, but it is easy to tell whether or not clicking on a title multiple times is a good indicator of taste and could be useful for making recommendations.

5. Replication, peer review evaluation and documentation

Publicly available Data Science research excels in documentation and repeatability. State of the art Neural Network models are freely available for download on GitHub with documentation and source code, allowing everyone with the expertise and decent hardware to tweak and alter the models even on their own PCs. Future experiments built upon these projects can be forked and pushed back to GitHub, allowing branching development of new ideas. Since code is publicly available, anyone can review it and suggest or commit improvements. The projects can even be interactive and include code via Jupyter Notebooks, making it possible to logically separate both blocks of code and the steps of the research process. This means that the data cleaning process and feature selection is also transparent and easily reproducible with a few clicks, so methods like p-hacking is even less likely to remain unnoticed.

However, most Machine Learning models rely on random number generators when building models, splitting data or generating random distributions. This can result in slightly different outcomes each time they are ran. To avoid this, random seeds can be introduced to ensure the same outcome with different settings. Computer generated random numbers are not actually random, they just seem random enough to use, and rely on an initial seed value (like the current time in milliseconds) to create the next series of random numbers. So as long as the initial seed is the same, results will remain the same. This, however does not work with distributed systems or when different kinds of computer architectures are used, as they might generate random numbers from seeds differently. This means that the very same types of Machine Learning models (even Neural Networks) could have slightly different outputs when trained on the very same data.

Summary

Machine Learning models generally lack transparency, could yield different results when trained multiple times on the very same data, will most likely learn some biases from the dataset, and need to be constantly maintained and updated with new data in order to remain useful. These are all problems that make them far less useful in scientific research. However, they are able to make use of datasets that are way too large for humans to comprehend. They are solutions created for problems where relationships in data are too complex for even humans to interpret. When all we have is large amounts of data and a well defined outcome, the best answer for this problem human kind has managed to come up with so far is Machine Learning. As more data becomes available, these models will inevitable keep gaining ground in other areas as well.

Machine Learning has a lot to offer to research if done correctly. As simple linear models are reaching their limits in explaining the events around us, data driven methods will most likely find their places in research. But instead of blindly relying on the outputs of black box models, it is safer to use a Machine Learning model’s results as a basis for further research, and evaluate its outputs both empirically and statistically in following conducted experiments. Machine Learning should be another robust tool in a scientist’s toolkit, not a replacement for frequentist statistics or the scientific method. Relying on machines to help us make decisions and use their outputs in calculations is nothing new, it’s just they weren’t called AI so far.

Further reading:

--

--

Richard Nagyfi
Cursor Insight

Data Science Researcher & PhD Student — Budapest, Hungary