Z-Score and How It’s Used to Determine an Outlier

Published in

Clarusway

5 min readFeb 1, 2021

One of the most commonly used tools in determining outliers is the Z-score. Z-score is just the number of standard deviations away from the mean that a certain data point is.

In your future data science life, Z-scores are gonna be a really useful way to think about how usual or how unusual a certain data point is. And that’s going to be really valuable once we start making inferences based on our data.

In this story, we will take a deep dive into our notebooks and learn how to detect outliers using Z-Score.

First of all, we need to import our libraries.

To create our data, we generate 200 samples, from a normal distribution, centered around the value 100, with a standard deviation of 5.

As we get the array as below we are ready to go.

First of all, let’s see how our data seems. So we use seaborn to visualize it.

As you see below chart, most of the values are scattered between the values 90 and 110 as it is obvious that we have chosen a normal distribution having an average value of 100 and a standard deviation of 5.

To use Z-Score and see the results, we need to have outliers in our array. So we change the data points 90 and 50 as seen below.

Now we need to visualize our new array.

In the below chart you may easily see the outliers which we created in our array.

Now we are ready to go. So we create a data frame from the distribution we have by using the code below. We create a column named “Data”.

Our data frame with the column “Data” is as seen below.

Calculating The Z-Score

As we have our data frame with outliers, we may now calculate the Z-Scores for each column.

Creating a new column in my data frame, which would be called "data_zscore" where I take the row value and I subtract that columns mean from that row value and divide by the standard deviation of that column. Normalized by N-1 by default. This can be changed using the ddof(Delta Degrees of Freedom) argument(default 1). The divisor used in calculations is N — ddof, where N represents the number of elements. We get the output as below:

Z-Score is essentially how many standard deviations away is my actual value from the mean value based on the business context, you can define the threshold value for the z score to classify a point as an outlier or not in the current scheme of things. Here we pass 3 for the threshold value as seen below:

abs() is one of the simplest pandas data frame function. It returns an object with the absolute value taken and it is only applicable to objects that are all numeric. It does not work with any Nan value either. abs() function can also be used with complex numbers to find their absolute value.

So we created a new column named “outlier”. let’s see or data frame.

We get the output as seen below. Now we have the data, Data_zscore, outlier columns.

Let's find out the rows at my Z-Score has classified as outliers.

As you see above output, Z-Score found the outliers which previously added to our array. For larger data, we can count the outliers with the code below. Here we have 2 outliers as counted.

We can also print the number of outliers with the code below.

As you see in the above example we defined the threshold value for the Z-score as 3 manually. We used it to get a better understanding of using the Z-score to determine the outliers. If you get its usage we can go further. We can automate the process with the code blocks below. We run all of them one by one. Just be sure if your df name matches the name in the code block.

As seen above, our code blocks visualized the outliers and calculated the optimum Z-score which we can use in our data frame to determine the outliers. In this example, it is calculated as 1. It can be seen on the 3rd chart.

We can renew our determinization with the new Z-Score(“1”) we got from our code blocks.

With the new value, we detected 5 outliers. We may drop the outlier rows and visualize the new data frame we get.

As seen below, we got a normal distribution without any outliers. Our df is now ready for further machine learning processes. We can apply the method to all columns one by one for larger data sets and get rid of the outliers.

Conclusion

If you know the mean you know the standard deviation. Take your data point, subtract the mean from the data point, and then divide by your standard deviation. That gives you your Z-score.

You can use Z-Score to determine outliers. When you determine outliers it depends on you to delete them or use log, winsorize, and similar methods. then your data becomes ready for machine learning features. Here we just focused on the usage of Z-Score. We didn’t include the discussions about Z-Score. For further info, you may follow the new stories.

idenw/statistics-with-python

Contribute to idenw/statistics-with-python development by creating an account on GitHub.

github.com

Z-Score and How It’s Used to Determine an Outlier

Calculating The Z-Score

Conclusion

idenw/statistics-with-python

Contribute to idenw/statistics-with-python development by creating an account on GitHub.

Written by Iden W.