Demystifying Statistical Analysis 9: Crash Course on Statistical Analysis

Published in

DataSeries

11 min readJun 5, 2022

Having gone through numerous statistics courses throughout my education, I’ve always wondered why the concepts often aren’t simplified first as an introduction, to help students better appreciate the landscape of the subject before diving into the more complicated formulas.

Inspired by Google’s Chief Decision Scientist, Cassie Kozyrkov, I’ve been trying to demystify statistical concepts through this blog series of mine, by explaining them in a way that is more intuitive to understand. This article is probably long-overdue, but I believe it is possible to explain what statistical analysis is about even to laypeople, without using any formula, and so I’m writing this.

Where does statistical analysis lie in the entire spectrum of data analysis?

First, to understand what statistical analysis is, we have to recognise that it is a type of data analysis that fulfills a very specific purpose. In a previous post on defining the different terms used in data science, I explained about Data Analysis and Data Analytics using the Gartner Analytic Ascendancy Model. As that model is what most people are familiar with, I will use it to explain where Statistical Analysis appears.

At first glance, I know the comparison table looks confusing and doesn’t seem right, because why would Gartner call “Descriptive Statistics” “Descriptive Analytics”, and yet Cassie differentiates Analytics from Statistics. But let me explain for a bit, and you’ll realise it’s because these terms have been very loosely defined and used, but what you should pay attention to is the type of question that is being asked.

In my previous post, I mentioned that the terms “Analysis” and “Analytics” are used interchangeably, even though “ Analysis” actually refers to the the general process of examining information, while “ Analytics “ refers to the techniques being used to conduct the analysis, usually in a quantitative manner.

Once we understand it this way, it becomes apparent that the term “Analytics” actually just means “Analysis Methods”, and so “Descriptive Statistics” being an analysis method wouldn’t be a contradiction! In that case, why does Cassie Kozyrkov differentiate Analytics and Statistics?

If you read Cassie’s article on “ What’s the difference between analytics and statistics? “, you’ll find that she defines “Analytics” as the method used to explore the data, while “Statistics” is the method used to infer what’s beyond the data. If you match it to the type of questions being asked, the way she uses “Analytics” would then clearly be “Descriptive”, and “Statistics” would be “Inferential”!

It looks like I’ve gone one big round and have yet to explain where “Statistical Analysis” appears in all these confusing terms, but this is where it gets mindblowing. Although the word “Statistics” appear in the term “Descriptive Statistics”, whenever we say “Statistical Analysis”, we are actually referring to “ Inferential Statistics” (the one in red in the comparison table), or what Cassie simply refers to as “Statistics”.

If you’ve managed to let that sink in, I’m going to throw another curveball — whenever we say “Statistical Analysis”, what’s done is actually “ Hypothesis Testing”. But it makes perfect sense! Because to validate a finding in the data, we need inferential statistics; and to make inferences using the data, we need to run statistical analyses; and to run statistical analyses, we need to form hypotheses and test them!

“Statistical Analysis” just means “Hypothesis Testing”.

Cassie also mentions in her article that descriptive analysis (or what she terms as “Analytics”) helps us to form hypotheses which improves the quality of our questions, while statistical analysis (or what she terms as “Statistics”) helps us to test our hypotheses which improves the quality of our answers.

I won’t be talking about “ Machine Learning” and “ Modelling & Simulation “ in this article, but all you need to know is that these methods match the levels in the Gartner model as shown in the comparison table. Each method has its own specific purpose, and it would be inappropriate, for example, to use machine learning to answer questions about what or why something has happened, or what should be done.

Why do we even need statistical analysis?

Now that you know where statistical analysis sits in the whole business of data analysis, the natural question to ask next is why do we even need it in the first place?

We’re all probably very familiar with with creating bar charts to make comparisons, or drawing scatterplots with a line cutting across to show its trend, which by the way, is part of descriptive statistics. Cassie Kozyrkov has written a fun and easy-to-understand article on the terms used in descriptive statistics, which I highly recommend you to check out if you’re not already familiar with them.

Fictitious charts meant for illustration.

As illustrated in the charts above, if the data is being used at face value, it would produce an elephant in the room on how sure we are that the information truly reflects the reality on the ground. That is where statistical analysis come in. By using statistical analysis, we can find out whether the two bars are statistically different, and whether the positive trendline perceived is statistically valid.

In other words, statistical analysis gives us more confidence to talk about our descriptive findings.

An important thing to note is that there are actually a few ways to conduct statistical analysis. The most commonly used method is the Frequentist approach, but there is also another school of thought called the Bayesian approach. I won’t go into details, but I’ve previously written an article to introduce Bayesian Analysis in a simple way, and Cassie has also written another one comparing the two approaches.

How do we conduct statistical analysis?

I hope I’ve managed to convince you why we shouldn’t interpret our data at face value, and how conducting statistical analyses gives us more confidence about our findings. But some of you may still be concerned that you have not been trained to do this, so I’ll share some tips on how to get started.

The first step in conducting statistical analyses is knowing how to choose the right statistical test, and in order to choose the right statistical test, we need to learn how to recognise the variables in our data.

You might have heard of other variables names such as “discrete variable” or “numerical variable” before, but regardless of the variable types you’ve learnt in the past, they should all be able to fall into either Categorical Variables or Continuous Variables.

Categorical Variables are variables where the values are in the form of groups, instead of quantifiable numbers. Even if there was some order in the groups such as “Small”, “Medium”, “Large”, or “Disagree”, “Neutral”, “Agree”, the variable is still considered to be categorical.

Continuous Variables are variables where the values are in quantifiable numbers, often taken from some sort of measurement. Categorical Variables that have some order (e.g. Likert scales) may sometimes be treated as Continuous Variables, but the legitimacy of such a practice is often debated academically.

Besides knowing whether the variables in our data are categorical or continuous, we also need to know which ones are the Independent Variables (IV), and which ones are the Dependent Variables (DV). IV and DV don’t describe the nature of the variable, but rather the part of the analysis that it is being used.

Independent Variables are variables containing values that go into the x-part an analysis (where y = mx + c), used to make predictions of y. Hence, in other fields such as machine learning, they are more commonly referred to as predictors, but names aside, they are actually the same thing.

Dependent Variables are variables containing values that are often used as the measurements, and while they still go into the y-part of the analysis (where y = mx +c), the interest of y is often what is being predicted, and hence they are also known as the predicted variable.

Regardless of the names being used, every analysis should consist of at least one IV and one DV, and knowing whether your IV and DV are categorical or continuous will help you to determine the statistical test to use. This is done by using the cheat sheet I created and introduced in my previous post.

The way this cheat sheet works is that you have to know whether your IV and DV are categorical or continuous, as well as the level in which they are categorical or continuous, then you should be able to find the intersection that tells you which statistical test you should be using. For example, if your IV is categorical with more than two between-subject groups, and your DV (or measure) is continuous, the appropriate test to use would be a one-way ANOVA.

If you previously didn’t understand t-tests and ANOVAs in this manner, it will take some time getting used to it. But once you start to notice the connection between these different statistical tests, you will find that the cheat sheet actually makes a lot of sense.

Bonus: A quick introduction to 3 common statistical tests

1. Chi-Squared (χ2) Test

First of all, the chi-squared test has nothing to do with Shang Chi. Also, it’s pronounced as “ kai-squared”.

Jokes aside, you will notice that the chi-squared test is used when both your IV and DV are categorical. This applies to a situation where you are for example comparing the responses of males vs females for a survey question on a scale of 1 to 4 (Strongly Disagree to Strongly Agree), especially if the agree responses are grouped together, separate from the grouping of disagree responses, just like below.

A scale from Strongly Disagree to Strongly Agree is actually categorical to begin with, and by grouping responses together, the number of groups will be further reduced. Hence, the chi-squared test would be the most appropriate test to inform if the responses of males vs females are statistically different.

2. Independent t-Test

However, as mentioned earlier, Likert scales are often treated as continuous variables, and so the means (or averages) from a scale of 1 to 4 in a survey may be calculated and compared between groups. When this happens, the DV becomes continuous and an independent t-test needs to be used instead.

The independent t-test is basically the test used to statistically compare means between two groups. If you have a hard time remembering what it refers to, just remember it as the “two-group means comparison analysis”. In fact, because the chi-squared test and t-test allow users to compare between two groups, they are often used for what is known is A/B testing, since A and B are two groups.

One thing to note is that the independent t-test is named as such because the comparison made is meant for two distinctly independent groups. If the comparison is made within the group at two different time-points (e.g. pre-post), then the more appropriate test to use would be the dependent or paired t-test. You can read more about pre-post analysis in my previous post.

Plaque in the Dublin Guinness Storehouse commemorating William Sealy Gosset.

A fun fact about the t-test is that it was developed by the Head Brewer of Guinness, William Sealy Gosset, in 1908. So from now on when you drink the stout, you will always be reminded of this statistical test.

3. Simple Linear Regression

When your DV remains continuous and your IV becomes continuous as well, you are no longer doing a means comparison between groups. What you have is a test of whether a linear relationship exists between the two variables, where one variable changes in accordance to the other. This test is also known as the simple linear regression.

The regression equation is related to y = mx + c.

You might have seen complicated regression equations before, but if you recall learning about y = mx + c back in high school, the regression equation is actually related to that. Essentially, in the regression equation is simply the slope or gradient which is m, and in the regression equation is the intercept or constant which is c. Once you understand this concept, regression becomes a lot less intimidating.

Fictitious chart meant for illustration.

The scatterplot is the go-to chart whenever both the IV and DV are continuous, and running a simple linear regression is essentially the process of drawing the “ best-fit line “ through the scatterplot, and letting the analysis tell us whether a statistically significant relationship exists. That’s really what simple linear regression is all about, and all other regressions fall back on this concept.

One additional thing to note is that all the other statistical tests such as t-tests and ANOVAs are actually built upon the linear regression, and that it is possible to express these tests in regression format. In my blog series on statistical analysis, I share how this is done using examples such as the independent t-test and the one-way ANOVA.

Conclusion

While I promised to explain about statistical analysis without using any formulas at the start, I eventually still talked about y = mx + c. But this goes to show that even just using what we’ve learnt in high school, we are still able to grasp the basic concepts of statistical analysis.

I continue to hold the belief that it is possible to simplify the concepts of technical topics to the point where laypeople are able to understand, and I hope I’ve managed to achieve that for this article. After all, Albert Einstein once famously said, “If you can’t explain it simply, you don’t understand it well enough.”

Originally published at: https://learncuriously.wordpress.com/2022/06/05/crash-course-on-statistical-analysis