Z test — Demystified

Anirudh Dayma
Analytics Vidhya
Published in
6 min readApr 23, 2020
Photo by Ben Kolde on Unsplash

We all get nervous when we hear about statistical tests and from many tests out there, a pretty famous test is the Z test. You might know when to use it but would have never thought why does it have a formula of certain type.

I promise you that by the end of this article you will feel like you own this test and you would be amazed by the knowledge which you will have about this test.

Lets get started.

This test assumes that you have some knowledge about Central Limit theorem. In case if you don’t, have a look at my previous post.

Before jumping upon to Z test we would go through some concepts which would hep us better understand this article.

Normal Distribution:

Normal distribution

Also known as Gaussian distribution, is a probability distribution which is symmetric about its mean. It signifies that most of the values are present near the mean and as we move away from the mean in either direction, the probability decreases. It is also called a bell shaped curve. Mean, median and mode coincide.

Standard Normal Distribution:

Source: mathsisfun

It is as a special case of Normal distribution with mean = 0 and standard deviation = 1.

To convert a Normal distribution to standard normal distribution we use Z score also called as Standard score.

Formula for z score

x = value that we want to standardize

µ = mean of the distribution of x

σ = standard deviation of the distribution of x

Empirical Rule:

Source: SAS

It states that approximately 68% , 95% and 99.7% of the data lies within 1, 2 and 3 standard deviations of a normal distribution respectively.

Central Limit theorem:

Central limit theorem(CLT) says that mean of the sampling distribution of the sample means is equal to the population mean irrespective of the distribution of the population and when the sample size is greater than 30.

Let us try of understand the meaning of the highlighted terms above, sampling distribution means that the distribution is made up of samples and the later part i.e. sample means implies that the distribution is of the statistic “means of the sample”. We know that in Central limit theorem we create number of samples with size greater than 30, calculate the mean of the samples and then plot them.

It also states that the sampling distribution of sampling means will follow a normal distribution

Mathematically it states that

Let μ be the population mean and σ be the population standard deviation. If we draw multiple samples of size N from the population then according to CLT the mean of sampling distribution of sample means is given as

Sample mean = population mean

and the standard deviation of sampling distribution of sample means is given as

The above term is also called as standard error. So for a any distribution we have standard deviation, in CLT we have a distribution of sample means, the standard deviation of sample means is called as standard error of the mean(just a fancy term).

Similarly if we plot a distribution of sample variances then the standard deviation of the distribution will be called standard error of variance.

Time for Climax

Now that we are done with the pre-requisites, let’s look at how the above topics are related to Z test, we will try connecting the dots.

We all know that Z test is used to check whether or not the sample distribution comes from a population with mean μ. To do this we check if the sample mean lies close or far from the population mean, if sample mean lies far away from population mean we say that it comes from a different population but if it lies close then we say it comes from the same population.

To do this we use a formula and check if the z statistic is greater than or less than 1.96 (considering two tailed test, alpha = 5%)

Z statistic

z = z statistic

X̄ = sample mean

μ = population mean

σ = population standard deviation

n = sample size

Let us try to understand why do we have this formula.

So we have a population with mean μ and sample mean . Using CLT we can say that we have many samples and we plot a distribution of sample means. According to CLT, the mean of this distribution of sample means will be equal to population mean μ and standard deviation will be σ/√n where σ is population standard deviation and n is sample size.

In Z test, we basically want to check how far does the sample mean lie from the population mean, we would always encounter different sample means and for different sample means we would do the same set of calculations to check how far is our sample mean from population mean.

To make this optimal we standardize our distribution. But we can only standardize normal distributions and we are not sure whether or not our distribution is normal. So how are going to achieve this?

Remember that above we have said that we will assume we have multiple sample means and our sample mean(X̄) comes from the distribution of sample means. CLT says that the distribution of sample means would follow a normal distribution.

Our statistic, the sample mean(X̄) hence comes from a normal distribution and so we can standardize our sample mean(X̄) using the z score formula.

Z score formula

So in above formula we want to standardize x which comes from a distribution with mean μ and standard deviation σ.

But in our case we want to standardize sample mean i.e. X̄ which comes from a distribution with mean μ (this is population mean = to mean of sample means) and standard deviation given by σ/√n. So substituting these values in the z score formula we get

Z statistic

Generally in hypothesis test we consider alpha = 5%, so in two tailed Z test we see critical values as 1.96 and -1.96. Where does 1.96 come from?

Source: sfu.ca

So the above graph is a standard normal distribution and we have seen that it is also a normal distribution. Using Empirical rule we know that approximately 95% values lie with 2 standard deviations of a normal distribution. Considering the total area under graph as 100%, 100 — 95 = 5% is left and our shaded region is 2.5 + 2.5 = 5%. So 1.96 standard deviations2 standard deviations and hence the value 1.96. So actually for a normal distribution exactly 95% values lie within 1.96 standard deviations, hence the value 1.96.

I hope that you might now have an in-depth understanding of Z test and would be feeling like you own it. If you are feeling amazed and feel like I have kept my promise that I had made at the start of the article then please do clap! clap! and clap! as this will motivate me to come up with such intuitive articles.

Feel free to drop comments or questions below, you can find me on Linkedin.

--

--

Anirudh Dayma
Analytics Vidhya

Machine Learning | Data Science Enthusiast | Technical Writer