When Do Inferential Statistics Matter in Data Science?
A Conceptual Guide for the “When” of Inferential Statistics
You know what always confused me when I transitioned into data science for the first time?
It was statistics. And I don’t mean that understanding statistics in and of themselves confused me…sure statistics can be quite challenging to understand…what I mean, is that understanding what role inferential statistics play in data science confused me. Especially trying to understand where statistics lie on the business side of data science.
You see, I came from a research laboratory in graduate school. So I had my fair share of statistics in my education at University. I knew that in order for us to draw conclusions from data we needed to use inferential statistics to help us estimate when the data at-hand may be showing us some patterns that generalize to a population [the quick and dirty definition of inferential statistics].
But in business data science, not academic data science, we build mathematical models of data that sometimes seemed similar to the statistics I used when at the University (e.g. regression models) but other times appeared to have almost no basis in inferential statistics (e.g. decision trees).
After years of working in the field, and after years of letting this confusion get to me, I finally started to understand where statistics fits into data science and where other forms of math, not statistics per say, may be more important.
So in order to help you advance that understanding and hopefully help you to better pull all of these ideas together more quickly, I am going to give you my high-level list of data science stages and the degree to which inferential statistics plays a role in those stages:
Data Storage & Access
Zero inferential statistics.
Understanding Data (Exploratory Data Analysis[EDA])
Lots of inferential statistics.
It would benefit you knowing how to use correlations, t-tests, ANOVAs, and multiple regressions and their associated p-values to help aid your understanding of the informational value of the data to the target variable.
Not only do statistics help us understand the informational value of data in relation to some target but they also help us feel more confident that the data may generalize beyond our specific data set.
Lots of inferential statistics but it pays to know the difference between parametric and non-parametric models.
Here’s where the use of inferential statistics can be useful but only in the case of some machine learning models. Not all machine learning models can be tested against some known distributional assumptions about the parameters of the model that are being estimated in the population. That is, the parameters of the model may change how they are distributed in the population. This is the core difference between parametric models and non-parametric models.
For more details on the distinction, check out this article.
Some inferential statistics.
Inferential statistics are great for testing models against each other in production. We can use hypothesis testing methods to make determinations about the performance of multiple models before deciding which should be used in production.
So, there it is. My conclusions regarding the role of inferential statistics in the data science lifecycle. Knowing this and diving a bit deeper into the distinctions between parametric and non-parametric model types can help you as a data scientist to make better decisions regarding when different models may be more or less beneficial to a given problem.
Understanding how to leverage inferential statistics also helps to set up opportunities for your own data science work to take advantage of the tools of inferential statistics in building better solutions.
I hope you found this quick overview of inferential statistics in data science to be of value.
Like engaging to learn about data science, career growth, life, or poor business decisions? Sign up for my newsletter here and get a link to my free ebook.