Statistics for Data Analysts: Descriptive Statistics with Python

Margaret Awojide
CodeX
Published in
5 min readAug 16, 2022
Credit: Wallpaper Safari

Statistics is the bedrock of data analysis. It is not enough to get mastery of the tools or language you are employing for analysis, the knowledge of statistics is important to produce accurate deductions from data. Statistics is present across all levels of the data analysis workflow, from data collection to presenting insights from data.

The definition of Statistics itself shows its relevance in data analysis. Statistics, in simple terms, helps to transform data into information for deriving insights. According to Carly Fiorina (Former CEO, Hewlett Packard) on the goal of data analysis:

“ The goal is to turn data into information and information into insights”

The purpose of statistics is to describe or predict data. Upon this premise, statistics can be divided into 2 categories: Descriptive Statistics and Inferential Statistics.

This article focuses on descriptive statistics using Python. For illustration, we would use a case study of student performance in an exam. The data used can be found here. The data contains the personal details of the students such as Gender, Race etc. as well as their test scores in mathematics, reading and writing.

Descriptive Statistics

Credit: Medium

Descriptive statistics is the act of summarizing or describing data for better understanding. To convey information from your data, it needs to be described and/or summarized in a way that makes sense to the target audience. Descriptive statistics can also be divided into 2 main categories: Measures of Central Tendency & Measures of Dispersion.

Measures of Central Tendency

The measures of central tendency attempt to use a single value to describe a set of data by identifying the central position of the data. The 3 most prominent measures are Mean Median and Mode.

Mean

The mean, or average, is a measure that attempts to summarize data by calculating the central position using the formula:

The mean can only be applied to a set of quantitative data. Using the student performance dataset, it is only possible to get the average for the student scores.

The mean is a very simple measure to calculate and it can help provide a value that can summarize the data. However, the mean is very sensitive to outliers/extreme values. For instance, the income of 5 men selected at random might be $100, $200, $100, $100000000 and $50. The mean that would be obtained from this data would not be a sufficient representation of it. For data with a lot of extreme values, the mean might not be your most preferred choice.

Towards Data Science

In Python, there are several ways the mean can be calculated, some of which are:

#From Scratchimport math
def mean(list1):
total = math.fsum(list1)
n = len(list1)
mean = total/n
return mean
values = [1,2,3,4,5]print ('From Scratch:',mean(values))#Using Numpyimport numpy as np
print ('Using Numpy:',np.mean(values))
#Using Pandas for the Student Dataimport pandas as pd
data = pd.read_csv("student_data.csv")
avg_reading = data['reading score'].mean()

Median

Credit : Cuemath

The median is another measure of central tendency that finds a central value for a set of quantitative data. The median is also very easy to compute and it is not affected by outliers like the mean. However, the median does not consider all the data points in the provided data. It emphasizes the position rather than the values and is sometimes referred to as a measure of location/position.

# Using Numpy import numpy as np
np.median(values)
#Using Pandas for Student Dataimport pandas as pd
data['writing score'].median()

Mode

This is the most frequently occurring value in a set of data. The mode is an ideal measure for qualitative/categorical data unlike the other measures of central tendency. Mode is also sometimes applied for discrete data. It should be noted that there can be more than one mode in a set of data. For a small dataset with little or no repeat, the mode might not be a sufficient measure.

import pandas as pd
data['gender'].mode()
data['race/ethnicity'].mode()

Measures of Dispersion

The measures of dispersion describe how scattered or spread out a dataset is. For most measures of dispersion, the variation of the data from the mean is calculated. Some examples of these measures are: Range, Mean Absolute Deviation & Variance/Standard Deviation.

Range

This is the simplest measure of dispersion/variability. It is simply the difference between the maximum and minimum value in a set of data. As you might have guessed, the range is not a good measure of dispersion because it is only based on two data points. The range does not change even when the values in between change.

def range(list1):
max = list1.maximum()
min = list1.minimum()
return max-min
range(values)

Mean Absolute Deviation

The Mean Absolute deviation is also a measure of dispersion that finds the average of the absolute deviation of data points from their mean. The Mean Absolute Deviation is a better measure than the range.

mad_reading = data['reading score'].mad()

Variance/Standard Deviation

Like other measures of dispersion, the variance describes how disperse a dataset is. It is a measure of dispersion that is calculated by computing the average of squared deviations from the mean. The standard deviation is computed as the square root of the variance. The standard deviation is the most popular measure of dispersion because it is the most efficient in describing how disperse a dataset is.

#Using Pandas
data['reading score'].std()

Conclusion

Descriptive Statistics comes in handy during the data exploration phase of data analysis. Sometimes, a statistic might be presented to the target user to buttress a point. An example is in a report by Healthline where the CDC (Center for Diseases, Control and Prevention) stated that the average height of an American Man, 20 years and above, is 5ft 9in.

Thanks for reading! In the next blog post, Inferential Statistics for Data Analysts will be discussed. The exploratory analysis for the student data, using descriptive statistics concepts in this article, can be found here. I hope you learnt something! Kindly follow me so that you’d know when the next article drops.

--

--