Machine Learning — The Scientific Way

Ben Houghton
Data & Waffles
Published in
6 min readJan 23, 2020

Machine Learning is often one of the top listed skillsets on a Data Scientist’s CV (it certainly is on mine). Some view it as even being synonymous with Data Science, and for most Data Science teams it is a core element of the day to day.

A debate I often have with other Machine Learning practitioners is around how to build the set of features. There are generally two camps: the ‘throw in all possible features and let the algorithm do the work’ camp and the ‘build features based on hypotheses’ camp. In a world where auto-ML solutions like DataRobot and H2O are becoming the talk of the town, the ‘throw in everything’ method is becoming more and more popular and is certainly accelerating the speed with which machine learning models can be built. The hypothesis-driven method, being slower and often more frustrating, is more frequently being neglected in a world where deadlines are often yesterday.

As a statistician first and foremost, my allegiance lies very much with the “hypothesis-first” approach and in this article, I will explain why I believe this approach has a wide range of benefits which outweigh the disadvantages of speed and complexity. The definition of Data Science which I generally use (see my first Medium post) has hypothesis-led analysis its core and so taking this approach in the world of Machine Learning aligns directly with this thinking.

Building simpler models might be a good thing

It’s often tempting to think that building giant models with huge complexity and a lot of features is the way to build the most accurate and effective model, and auto-ML solutions certainly provide an efficient tool for doing this. However, for each super-complex model, there may well be a simpler one which is equally (or even more) accurate; has fewer variables and is easier to explain. The ocean-boiling approach to ML leads you away from finding these simpler gems in favour of having heavy pieces of machinery.

In the world of GDPR and heightened governance in the ML space existing in most organisations, interpretability of Machine Learning models in particular is becoming extremely important. Having few features in your model really gives you an advantage when you try to explain how an individual prediction has come about.

Relationships can exist by chance

It’s easy to think that if we observe evidence of a relationship between two variables, then there must be one (whether it’s there through causation or some other factor). The truth is, relationships can exist by chance as well, throwing off the results of models severely. This comic illustrates the below argument perfectly.

Let’s take an example — say we are running a linear regression of one variable, x, against a response variable, y (the simplest machine learning problem that exists). We will hence look for the following relationship:

Y = ax + b + e, with a, b real numbers, error normally distributed

After running the calculations and calculating the estimates of the coefficients a and b, we might want to test whether x’s impact on y is “statistically significant”. This amounts to testing a null hypothesis of “a is 0” against an alternative “a is non-zero”. Let’s say that from this test (a t-test), we obtain a p-value of 0.01. What does this actually tell us?

Of course, this is less than the usual threshold used in statistics of 0.05, so we might quite rightly go on to suggest that we can reject the null hypothesis. However, when we think about what this p-value actually tells us, we can start to understand how random relationships can often be observed. Our p-value in fact tells us that if the null hypothesis is true (i.e. a is zero so x doesn’t affect y), then we would observe data like ours 1% of the time. In other words, as we have seen this data, the null hypothesis could still be true and x may indeed have no effect on y — it’s just that the chances are small.

You can imagine that if we start to include 100s or 1000s of features in our linear regression (or even any other model), then the chance that one or more of the variables give us false hope of a relationship starts to become more and more significant.

Correlation and Causation

Anybody who has ever read a popular statistics book or done any course in statistics will have heard that “Correlation does not imply causation”, the classic example being that ice cream sales and sunglasses sales are highly correlated despite neither causing the other (and weather being the cause of both — see this article for more information). From a machine learning perspective, this phenomenon extends to important features in a machine learning model not necessarily being a direct cause of the change in a target variable.

When we build a model using an ‘ocean’ of features, we will likely have a few which are being flagged as highly important (either by chance as in the previous section or by some genuine relationship). We may see this and immediately assume a direct relationship between the important feature and the target variable.

For example, we might (naively) build a model to predict ice cream sales using features such as the number of sunglasses people buy, the number of car parking spaces filled next to the beach and many others. It so happens that the number of sunglasses bought is the most important feature.

This leads to a number of issues. The biggest challenge is that we may have built a model which, by relying on the erroneous important variable, struggles to adapt when unexpected changes occur in the variable (which weren’t observed in the training set). For example, suppose a brand-new electric pair of sunglasses (?) is brought out and sales go through the roof (with none of the other variables necessarily being affected). With that variable being the most important in the model, the prediction of the number of ice creams bought will almost certainly be affected, leading to a less accurate prediction.

For a subtler example, consider the question of making a time series forecast using in a machine learning framework. In this case, your features will be a variable’s value at lags t-1,t-2,t-3,…,t-n and you are to predict the value at the current time. The features will all most likely to be correlated, so an auto-ml method may have a lot of trouble working out the right structure of the model. A hypothesis driven approach will instead give you a framework for which variables to include (e.g. including the most recent ones or moving averages of previous ones)

Discouraging exploratory analysis…

With Auto-ML tools sat there ready to use and a dataset ready to throw in, it’s very natural to simply throw this data into the tool and see what kinds of outcomes you get. This may make you tempted to skip one of the most important parts of the ML development process — Exploratory Data Analysis (EDA).

Through EDA, you can discover a variety of issues that aren’t necessarily detected or flagged through Auto-ML tools: wrongly interpreted data (i.e. choosing data that doesn’t actually do what you thought it did), anomalous data, missing data and, worst of all, target leakage. All of these can lead to misleading results. You may also gain additional insight into the kinds of variables you might want to create, and potentially creating a killer feature which your auto-ML solution couldn’t find.

The act of running exploratory data analysis can also find you (what I like to call) side-insights: additional nuggets of information that you find through the visualisations and the calculations involved. These can often prove valuable to stakeholders — given Data Scientists are looking to solve a different problem to insights analysis, one may encounter a novel way of looking at the data which brings a new light to the product or customer it represents.

…and discouraging complex feature engineering

Some of the best and most accurate models I have seen in my career have not been down to the complexity of the algorithms involved, rather to the creativity and engineering behind the features. I have seen some fantastic models which have been effectively fueled by one or two brilliantly creative data points.

Auto-ML solutions often give the impression that feature engineering is done pretty much automatically and thus drive you away from doing this manually. If you instead come from a list of hypotheses, your focus will be on building the data points relating to these, potentially building some which auto-ml solutions just can’t (yet) dream of.

--

--

Ben Houghton
Data & Waffles

Principal Data Scientist for Analytical Innovation at Quantexa