Data Science and Statistics

A data scientist is a statistician who works in San Fransisco

So goes a witty quip that I have tried and failed to find an attribution for. But in my experience the predicate of this; that data scientists do the same work as statisticians, isn’t actually true in essence or even in spirit. Among many data scientists I encounter, I feel a disdain towards statistics to a degree that is detrimental. To understand why, start with a practical working definition of statistics that I just made up

A set of tools and methods to make robust inferences from measurements of a sample to a population

A simple example would be consideration of the relation between height and age in adults in the US. The population here is the entire adult population of the US and the sample would be, let’s say, 10,000 adults that you have paid to give you their measurements and ages. Tools from statistics would help you determine if the relationship that you learn from your 10,000 volunteers between age and height, also adequately describes the other millions of adults you didn’t measure.

N=all

But the origins of much classical statistics are more humble than this (as eloquently described in Data Analysis using Open Source Tools). Many statistical measures we use today were first applied to problems of agriculture and brewery; determining if different treatments of crops and hops would improve yield. In this case datasets were small and acquiring more datapoints was expensive. Figuring something out by using smart maths instead of spending time and money to get more or better data? So far so good.

Fast forward to a typical data science problem today and classical statistics seems hopelessly quaint. Consider a study on mobility patterns by a typical cell phone company with a decent market share of, say, 10% of the population of a country. If we were to draw a conclusion based on the user base of this company the sample size would be comparable to that of the entire population! This breakdown of the typical situation facing statisticians lies behind the phrase ‘n=all’.

P-values and High Dimensions

But the problems with statistics as practiced historically don’t end with the number of datapoints, but also the dimensionality of each. The p-value basically measures the probability that, for a given result, you would find an effect of the same size or larger with random data. For some time this was seen as a measure of the significance of a result with the convention being that values less than 0.05 are ‘significant’ and values above are not.

Never mind the arbitrariness of that threshold, more worryingly random data would, by definition, pass this significance test 5% of the time! (FiveTwentyEight featured a great interactive tool to see this for yourself). Therefore a more strict criterion should be applied to avoid cherry picking those variables or conditions that come in under the bar or p-hacking. Again, this is a modern problem: that of rich and cheap datasets with an abundance of variables or conditions.

Statistical Advocacy

Like many under-learned subject areas, I suspect that the root of the problem is that statistics is not taught as well as it might be. The typical statistics curriculum is a dry walk through of various tests, their necessary assumptions and terminology, perhaps with some toy test cases that the examiner might even force the unlucky student to crunch by hand. This is to be contrasted with data science ‘in the wild’ when you would often like to formalise some conclusion about a numerical finding with a statistical test. In this case these obscure methods come to the rescue!

Consider a scenario when we would like to determine the popularity of a post or product based on a number of upvotes and downvotes e.g. Reddit posts, Amazon products etc. Let’s say we have 100 upvotes and 0 downvotes; in this case we can be pretty confident in assigning a high popularity. But what about early on when there is only one upvote? Strictly speaking that represents a 100% approval rating, but surely the fact that it is based on a single rating should be borne in mind. Statistics provides just such a measure: the Wilson Score Interval: an extremely useful measure for a common problem that many have surely encountered. For me this is statistics at it’s best: formalising an intuitive measure of ‘how good’ a result is or ‘how strong’ an effect is.

I’m strongly of the opinion that the plethora of available MOOCs, tutorials, blogs and other resources have democratised data science for the better. However I would argue that one of the casualties of this simplification of tackling a data science problem end-to-end is this kind of statistical rigour (closely followed by omission of data acquisition and cleaning). It’s hard to Google for a statistical test for a particular situation when you don’t know how to formalise that situation in terms of distributions, trials and p-values or if such a test even exists. A worse scenario is when the test or tool you know and reach for is not the right one for the job: when all you have is a hammer, everything looks like a nail. For example, t-tests assume normality of distributions and many statistics don’t deal well with with heavy tailed distributions.

So what’s to be done? A short but rigorous delve into the basics of frequentist statistics will reward you many times over: p-values, confidence intervals, bootstrapping, t-tests, the Kolmogorov-Smirnov test are good places to start (both Statistics in a Nutshell and Data Analysis with Open Source Tools served me well here). Economics, psychology and medicine are used to smaller datasets so reading in these areas will help you see these tools in action. The fantastic Shape of Data blog capitalises on the graphical intuition behind many statistical ideas. Well developed software packages such as Scikit-learn have great documentation and offer worked examples of statistical tests.

But one can only brush up on palliative methods to a certain degree. The truth is that new data sources offer a brave new world for understanding human behaviour. Going door to door asking people to fill in surveys can offer rigorous sampling to counteract potential bias in who was included, but can do nothing to stop people lying, misunderstanding or misremembering. Digital footprints from social media, web searches or elsewhere offer an objective record of what someone said or did, yet it offers limited insight into who is included and less into whether these people are representative. For me, coming to terms with this is the greater challenge for data scientists.

Endnote

The elephant in the room here is the role of Bayesian statistics relative to frequentist statistics as I have been discussing. All I will offer for now is a promise to return to this topic and cover Bayesian methods in more detail in the future.