Lies, Big Lies, and Data Science?
I’m sure that with all the hype surrounding data science, machine learning, and artificial intelligence, you’ve been given the impression that data is infallible. People building these models are wizards who know what they are doing. Data scientists are here to revolutionize every industry, you have a problem they will come up with a solution, provided you have the data.
Nothing can be further than the truth. In reality, data science is more nuanced than plugging data into a model and getting results. If you’re half smart you can figure out a way to get or “lie” to get the results you want. If you’re someone who isn’t well versed in machine learning or statistics, this article will surely enlighten you to figure out if you’re being “lied” to.
Below are common ways you could be fooled. The research done for the article is from a book by Darrel Huff, appropriately named “How to Lie with Statistics” and from various other resources like Google’s Chief Decision Scientist — Cassie Kozyrkov’s youtube series “Making Friends with Machine Learning”. Lastly, it has some information from my own experiences of being “lied” to.
I’m above “average”
Often people are presented with statistics in the form of averages. For example, in the news, you’ll hear about average prices, income or rainfall, etc.
Have you ever wondered how the average was calculated or more precisely which “average” is being talked about? Most of us assume that the person is talking about the arithmetic mean. Simply summing all measurements or values and divide by the number of measurements.
There are of course other “averages” like taking the middlemost value (median), the most frequent value (mode), the multiplicative “average” i.e. geometric mean, etc.
You might be wondering why does it matter? To motivate why this is important let’s look at an example. Let’s imagine you’re a cricket coach, trying to make a crucial decision in the middle of a match. Your team is currently chasing and needs 30 runs to win. You only have one wicket left and you need to send out your last batsman. You have two choices for who to send out Player X or Player Y. This is an important decision as it might make or break your team’s chances of winning the series. How would you evaluate which player to send out?
Luckily for you, your team management hired a data scientist to help you figure out these sorts of dilemmas. You ask them for help, they say let’s check the scoring distribution of each of these players (a fancy way of saying let’s look at how likely each player is in scoring a range of runs). They visualize the scoring distribution of each of the players.
Note: I understand that in cricket there are a plethora of variables to consider when evaluating players but for this example lets assume everything else is constant and the only differentiating factor is the runs each player scores.
Can you decide which player to send out in this situation? Looking at the arithmetic mean you can say Player X is better as he/she has a higher mean of 37.4 runs against Player Y’s 25.7 runs. But if you look at the median scores you can see that Player Y scores above 20 runs 50% of the time and below 20 runs 50% of the time, which is higher than Player X’s median of 19 runs.( Maybe Player Y has a higher probability of making 30 runs and player X’s higher mean is due to outliers).
Confusing isn’t it? The correct answer needs more analysis — see the part on distributions next to find out who to send out — but intuitively the median provides a better metric in this case but that alone might not be sufficient.
Is it actually “Normal”?
Continuing the cricket example, the problem that you as the cricket coach are trying to solve equates to asking which Player has a higher probability of scoring at least 30 runs. In mathematical notation it is equivalent to asking which is more P(Player X score ≥ 30 ) or P(Player Y score ≥ 30).
When trying to solve such inequalities we look at the Cumulative Density Function (CDF). (CDF is a fancy name for a function that tells what is the probability of a random variable X (i.e player scores) being less than a value c (30 runs in our case )). If ever in your life you’ve encountered percentiles you’ve been introduced to the concept of a CDF. For example, the 99th percentile corresponds to the value of X where the CDF(X) =0.99 & the median is the value at which the CDF is 0.5 or the 50th percentile.
Don’t worry if you find these terminologies confusing, all you need to know is that we need to look at the percentiles of each Player’s score and which ever player crosses 30 at a lower percentile has a higher probability of scoring at least 30 runs.
As you can see that for this problem Player X scores 30 runs or less at 59th Percentile, which translates to a probability of scoring at least 30 runs to be 41%. Which is higher than Player Y’s probability of scoring at least 30 runs of 29% (71st Percentile translates into 71% of the time Player Y scores less than 30 runs, in order to get the probability of scoring more than 30 we need to subtract by 1). Notice, that the answer was not at all obvious and no way of finding it out using just using “averages”.
Look closely at the CDF and see how it is jagged, and both players were almost going neck to neck up until the median. After which Player X starts scoring more aggressively. The point is to show that commonly occurring phenomenon can have a variety of different distributions.
Probability distributions come in all shapes and sizes; they are very important when making inferences or modelling. One common way to “lie” is to infer information based off the wrong distribution. In school and most introductory statistics classes people are taught to assume a normal distribution. The Normal Distribution is ubiquitous because of a very important result in statistics called the ‘Central Limit Theorem’.
“The theorem states if you take sufficiently large independent samples from any distribution with finite variance and average them then the sample mean will be normally distributed”.
Notice how this definition glosses over many important points, like what are “independent samples” and what does “sufficiently large sample” mean in practice? Does our sample size need to be 10, 100, 1000, or 1,000,000? Also, is it possible for our data’s underlying distribution to have infinite variance?
The answer is that it depends on the distribution you’re sampling from and how you’re sampling. There is a notion of how “normal” your distribution is; which can be measured by the Kolmogorov–Smirnov test for normality. And yes there are examples of phenomena where the variance could be infinite for example, it has been argued that the distribution of daily stock returns is actually infinite. Unfortunately, these are topics way beyond the scope of this article but you should check them out.
The important thing to remember is that you can’t assume a normal distribution just because of the central limit theorem, you need to first perform statistical tests like Kolmogorov–Smirnov test, to see whether the assumption holds and then try to make inferences. In the above cricketing example, if your “data scientist” just went ahead and assumed each player’s scores comes from a normal distribution and looked at the CDFs, they might produce something like this.
The red area indicates areas where the players score negative runs, which makes no sense! Although, for our original problem statement — which player is more likely to score 30 runs or higher — the normal approximation gives the correct result; it nevertheless gives wrong estimates for the probability of those outcomes, stating that the underlying distribution is “probably” not normal.
Why use the normal distribution, if it gives erroneous results? Well, the normal distribution is a model which given certain assumptions could be very useful. However, just like any model, it is “wrong” and its usefulness is based on the assumptions under which it operates. This brings us to my next point that models, whether simple probability distributions or highly sophisticated neural networks all work under a very tight set of assumptions, and any good data scientist, must understand what those assumptions are, how they operate, and when and how they could be violated.
All models are “wrong”
It has been said that “all models are wrong but some models are useful.” In other words, any model is at best a useful fiction — there never was, or ever will be, an exactly normal distribution or an exact linear relationship. Nevertheless, enormous progress has been made by entertaining such fictions and using them as approximations. — George Box
The above quote from statistician George Box perfectly summarizes the point I want to convey. I’ve elaborated on some of the problems in using the normal distribution, now let’s look at how models are “wrong” and not understanding how or why they are wrong can produce all sorts of “lies”.
Since it would be impossible to list all algorithms and tell how to apply each correctly, let’s look at just one and analyze how its wrong application could lead to erroneous results. Ordinary Least Squares Linear regression is one of the most basic and ubiquitous applied methods used for predicting continuous outcomes.
Assumptions for Linear Regression (OLS):-
- Linearity: The dependent variable y & independent variable(s) x are linearly related (exponent of x is 1)
- Residual Normality: The sum of squared residuals: sum(y(actual)- y(predicted))² is normally distributed, you can use the Kolmogorov–Smirnov test to determine if this is true
- Independence & no multicollinearity: None of the x variables are dependent on each other, correlation between independent variables should be zero or little
- Homoscedasticity: The error or residuals have the same variance for all values of x.
- No outliers or censoring: There should be no or few points that can bias y, for example, huge values way above the mean of y or censoring at zero (y —common when what you’re regressing can’t be negative — like cricket scores)
Looking at the above chart, can you tell if the relationship between Y (Player’s score) and X( Balls Played ) is linear or would a higher degree polynomial function like x² be more suitable? How would you know? One way to answer is to try out different mathematical functions and then use a cost function to determine which is the best fit. However, the answer is once again not obvious, a linear relationship would suffice for most values but would fail horribly on the lower end when the player gets out with zero runs. Player scores are censored below zero — it is impossible to score negative runs.
Linear regression would fail to give a reasonable answer, due to the aforementioned zero values and the existence of outliers. One solution is to use a “Tobit” model, which combines a binary classifier with a regression model. The binary classifier predicts if the player will be out on zero or not. If the classifier predicts not out, then run a linear regression on the rest to get a more accurate fit.
To check to see if the residuals have a normal distribution you can use a Quantile-Quantile plot (QQ plot). The x-axis is the theoretical distribution of quantiles ones which hold if the data has a perfect normal distribution, the y-axis values are residuals of our model. The 45 degree red line is a benchmark, if all points lie on the line, then the residuals are perfectly normal. The deviation from the line is an indicator of how far off our sample residuals are.
In this case, the residuals are not normally distributed because most points are far away from the red line, which indicates that if you want to make any statistical inference with the model i.e hypothesis testing, you should not use linear regression. Instead, construct a generalized linear model (GLM), with a better fitting distribution for the residuals. Linear regression and other linear models such as logistic regression are both examples of GLMs with their own set of assumptions. An easier way to think about it is in terms of cakes. If Linear regression is chocolate fudge cake then GLMs are the set of all possible cakes.
Homoscedasticity in plain English means that the distribution of variance of y does not change with values of x. Variance of y given x is the same when x is 20 or when x is a 100. The above chart shows the residuals of our model against the fitted or predicted values of our models. As the fitted values increase ( in this case x increases since positive relationship between x & y) the residuals become more spread out. Implying that probably the variance of y does not follow the same distribution for all values of x. So the batsmen runs (Y) are heteroscedastic not homoscedastic, meaning that you cannot make statistical inferences about the relationship between x & y.
Important note: Heteroscedasticity, multicollinearity and residual normality are problems if you want to make statistical inferences using the estimates of your regression. Which more often than not is more insightful then simple predictions! In this specific example the estimated coefficient of x would tell us how much additional score does the Player score for each addition ball they play.
I hope this post has been insightful in making you understand how data science is much more intricate than simply plugging and chugging without taking care for implicit assumptions behind analyses.