Learn Basic Statistics with Python

Shahzaib Khan
18 min readJul 24, 2020

--

Find out how to describe, summarize, and represent your data visually using NumPy, SciPy, Pandas, Matplotlib, and the built-in Python statistics library.

In the modern world, everything is data-driven. The amount of data produced every second in the world multiples into terabytes, hence the aspect of working with data is essential in order to describe, summarize, and represent data visually. That is where the field of Statistics come in to place.

Statistics is the science of collecting, organizing, analyzing, and interpreting data. The knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results.

Data scientists must have a deep understanding of statistical concepts in order to carry out quantitative analysis on the available data. Therefore, they must learn statistics for data science to be successful — this is a given.

Unlike other tutorials, this is a concise tutorial for people who think that reading is boring. I try to show everything with simple code examples; there are no long and complicated explanations with fancy words.

The current post focuses on the following topics:

Understanding Descriptive Statistics
Choosing Python Statistics Libraries
Measures of Central Tendency
Measures of Variability
Summary of Descriptive Statistics
Measures of Correlation Between Pairs of Data
Visualizing Data

Note: For this post we will be using Windows as our Operating System along with PyCharm as IDE.

Understanding Descriptive Statistics

Descriptive statistics is about describing and summarizing data. It uses two main approaches:

  1. The quantitative approach: describes and summarizes data numerically.
  2. The visual approach: illustrates data with charts, plots, histograms, and other graphs.

You can apply descriptive statistics to one or many datasets or variables. When you describe and summarize a single variable, you’re performing uni-variate analysis. When you search for statistical relationships among a pair of variables, you’re doing a bi-variate analysis. Similarly, a multi-variate analysis is concerned with multiple variables at once.

Types of Measures

In this tutorial, you’ll learn about the following types of measures in descriptive statistics:

  • Central tendency tells you about the centers of the data. Useful measures include the mean, median, and mode.
  • Variability tells you about the spread of the data. Useful measures include variance and standard deviation.
  • Correlation or joint variability tells you about the relation between a pair of variables in a dataset. Useful measures include covariance and the correlation coefficient.

You’ll learn how to understand and calculate these measures with Python.

Population and Samples

In statistics, the population is a set of all elements or items that you’re interested in. Populations are often vast, which makes them inappropriate for collecting and analyzing data. That’s why statisticians usually try to make some conclusions about a population by choosing and examining a representative subset of that population.

This subset of a population is called a sample. Ideally, the sample should preserve the essential statistical features of the population to a satisfactory extent. That way, you’ll be able to use the sample to glean conclusions about the population.

Outliers

An outlier is a data point that differs significantly from the majority of the data taken from a sample or population. There are many possible causes of outliers, but here are a few to start you off:

  • Natural variation in data
  • Change in the behavior of the observed system
  • Errors in data collection

Data collection errors are a particularly prominent cause of outliers. For example, the limitations of measurement instruments or procedures can mean that the correct data is simply not obtainable. Other errors can be caused by miscalculations, data contamination, human error, and more.

There isn’t a precise mathematical definition of outliers. You have to rely on experience, knowledge about the subject of interest, and common sense to determine if a data point is an outlier and how to handle it.

Choosing Python Statistics Libraries

There are many Python statistics libraries out there for you to work with, but in this tutorial, you’ll be learning about some of the most popular and widely used ones:

  • Python’s statistics is a built-in Python library for descriptive statistics. You can use it if your datasets are not too large or if you can’t rely on importing other libraries.
  • NumPy is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called ndarray. This library contains many routines for statistical analysis.
  • SciPy is a third-party library for scientific computing based on NumPy. It offers additional functionality compared to NumPy, including scipy.stats for statistical analysis.
  • Pandas is a third-party library for numerical computing based on NumPy. It excels in handling labeled one-dimensional (1D) data with Series objects and two-dimensional (2D) data with DataFrame objects.
  • Matplotlib is a third-party library for data visualization. It works well in combination with NumPy, SciPy, and Pandas.

Installing Statistics Libraries

To start with, we will be using PyCharm i.e Python IDE (Integrated Development Environment).

You can download the PyCharm community version which is FREE and can be downloaded from their official website and follow the steps as shown over the video:

Once you have setup Python and PyCharm. Let’s install Statistics Libraries.

To install libraries on PyCharm, click on File and go to the Settings. Under Settings, choose your Python project and select Python Interpreter.

You will see the + button. Click on it and search for the packages in the search field one by one. You will see the package as the left side and its description, version on the right side.

To start with, we will install statistics library.

Selecting the package click on the Install Package on the left bottom. It will install the package.

Now repeat the same steps for other libraries: NumPy, SciPy, Pandas and Matplotlib. Look for their packages and install them.

How to test if libraries are installed or not?

After the installation of the on library packages over the system, you can easily check whether is installed or not. To do so, just review the list

So we are all set to start learning Statistics with Python.

Measures of Central Tendency

The measures of central tendency show the central or middle values of datasets. There are several definitions of what’s considered to be the center of a dataset. In this tutorial, you’ll learn how to identify and calculate these measures of central tendency:

Mean

The sample mean, also called the sample arithmetic mean or simply the average, is the arithmetic average of all the items in a dataset. You can calculate the mean with pure Python using sum() and len(), without importing libraries:

x = [8.0, 1, 2.5, 4, 28.0]
mean = sum(x) / len(x)
print (mean)

Output:

8.7

Although this is clean and elegant, you can also use Python NumPy library functions:

import numpy as np
x = [8.0, 1, 2.5, 4, 28.0]
mean = np.mean(x)
print (mean)

Note — If there are nan values among your data, then np.mean() will return nan as the output. Hence you will need to handle it. For that we use another NumPy function:nanmean()

x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
mean = np.nanmean(x_with_nan)
print (mean)

Output:

8.7

Weighted Mean

The weighted mean, also called the weighted arithmetic mean or weighted average, is a generalization of the arithmetic mean that enables you to define the relative contribution of each data point to the result.

The weighted mean is very handy when you need the mean of a dataset containing items that occur with given relative frequencies.

For example, say that you have a set in which 20% of all items are equal to 2, 50% of the items are equal to 4, and the remaining 30% of the items are equal to 8. You can calculate the mean of such a set like this:

x = [2,4, 8]
w = [0.2, 0.5, 0.3]
weighted_mean = np.average(x, weights=w)
print (weighted_mean)

Output:

4.8

Note — We have used here a function of NumPy called as average()

Median

The sample median is the middle element of a sorted dataset.

x = [8.0, 1, 2.5, 4, 28.0]
median = np.median(x)
print (median)

Output:

4

Just like mean, you can also find out the median for a nan values among your data.

x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
median= np.nanmedian(x_with_nan)
print (median)

Output:

4.0

Mode

The sample mode is the value in the dataset that occurs most frequently. To find mode we will need to use the python statistic library function.

z = [2, 3, 2, 8, 12]
mode= statistics.mode(z)
print (mode)

Output:

2

Measures of Variability

The measures of central tendency aren’t sufficient to describe data. You’ll also need the measures of variability that quantify the spread of data points.

Variance

The sample variance quantifies the spread of the data. It shows numerically how far the data points are from the mean.

x = [8.0, 1, 2.5, 4, 28.0]
mean = statistics.mean(x)
variance = statistics.variance(x,mean)
print (mean)

Output:

8.7

Standard Deviation

The sample standard deviation is another measure of data spread. The standard deviation is often more convenient than the variance because it has the same unit as the data points. Once you get the variance, you can calculate the standard deviation with pure Python:

x = [8.0, 1, 2.5, 4, 28.0]
mean = statistics.mean(x)
variance = statistics.variance(x,mean)
standard_deviation = variance ** 0.5
print (standard_deviation)

Output:

11.099549540409287

Although this solution works, you can also use statistics.stdev():

x = [8.0, 1, 2.5, 4, 28.0]
standard_deviation = statistics.stdev(x)
print (standard_deviation)

Skewness

The sample skewness measures the asymmetry of a data sample because of which the curve appears distorted or skewed either to left or right of the normal distribution in a dataset. In other words, skewness is the extent to which a distribution differs from a normal distribution.

Positively and negatively skewed distribution:

skewness = 0 : normally distributed.
skewness > 0 : more weight in the left tail of the distribution.
skewness < 0 : more weight in the right tail of the distribution.

You can also calculate the sample skewness with scipy.stats.skew():

x = [8.0, 1, 2.5, 4, 28.0]
y = np.array(x)
skewness = scipy.stats.skew(y, bias=False)
print (skewness)

Output:

1.9470432273905927

Percentiles

Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.

Example: Let’s say we have an array of the ages of all the people that lives in a street.

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.

The NumPy module has a method for finding the specified percentile:

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = np.percentile(ages, 75)
print(x)

Range

The range of data is the difference between the maximum and minimum element in the dataset. You can get it with the function np.ptp():

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = np.ptp(ages)
print(x)

Output:

80

Note — Max value is 82 and min value is 2, hence difference is 80.

Summary of Descriptive Statistics

SciPy and Pandas offer useful routines to quickly get descriptive statistics with a single function or method call. You can use scipy.stats.describe() like this:

x = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
result = scipy.stats.describe(x, ddof=1, bias=False)
print(result)

Output:

DescribeResult(nobs=21, minmax=(2, 82), mean=32.38095238095238, variance=540.047619047619, skewness=0.6679020341687003, kurtosis=-0.029954628545992623)

describe() returns an object that holds the following descriptive statistics:

  • nobs: the number of observations or elements in your dataset
  • minmax: the tuple with the minimum and maximum values of your dataset
  • mean: the mean of your dataset
  • variance: the variance of your dataset
  • skewness: the skewness of your dataset
  • kurtosis: the kurtosis of your dataset

You can access particular values with dot notation:

x = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
result = scipy.stats.describe(x, ddof=1, bias=False)
print(result.nobs)
print(result.minmax[0])
print(result.minmax[1])
print(result.variance)
print(result.skewness)
print(result.kurtosis)

Output:

21
2
82
540.047619047619
0.6679020341687003
-0.029954628545992623

Pandas has similar, if not better, functionality. Series objects have the method .describe():

x = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
z = pd.Series(x)
result = z.describe()
print(result)

Output:

count    21.000000
mean 32.380952
std 23.238925
min 2.000000
25% 11.000000
50% 31.000000
75% 43.000000
max 82.000000
dtype: float64

It returns a new Series that holds the following:

  • count: the number of elements in your dataset
  • mean: the mean of your dataset
  • std: the standard deviation of your dataset
  • min and max: the minimum and maximum values of your dataset
  • 25%, 50%, and 75%: the quartiles of your dataset

You can also access each item of result with its label:

x = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
z = pd.Series(x)
result = z.describe()
print(result['mean'])
print(result['std'])
print(result['min'])
print(result['max'])
print(result['25%'])
print(result['50%'])
print(result['75%'])

Output:

32.38095238095238
23.238924653426178
2.0
82.0
11.0
31.0
43.0

Measures of Correlation Between Pairs of Data

Variables within a dataset can be related for lots of reasons.

For example:

  • One variable could cause or depend on the values of another variable.
  • One variable could be lightly associated with another variable.
  • Two variables could depend on a third unknown variable.

It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.

You’ll see the following measures of correlation between pairs of data:

  • Positive correlation exists when larger values of 𝑥 correspond to larger values of 𝑦 and vice versa.
  • Negative correlation exists when larger values of 𝑥 correspond to smaller values of 𝑦 and vice versa.
  • Weak or no correlation exists if there is no such apparent relationship.

The following figure shows examples of negative, weak, and positive correlation:

The two statistics that measure the correlation between datasets are covariance and the correlation coefficient.

Before we look at correlation methods, let’s define a dataset we can use to test the methods.

Let us randomly generate the 2 variables.

# generate related variables
from numpy.random import randn
from numpy.random import seed

# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# summarize
print('data1: mean=%.3f stdv=%.3f' % (np.mean(data1), np.std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (np.mean(data2), np.std(data2)))

Output:

data1: mean=100.776 stdv=19.620
data2: mean=151.050 stdv=22.358

Though this will go a little advance but we will now generate a graph to see how our points are plotted. We will use scattered plot.

Note — We will be learning visual plotting in the later section. This is just to visualize right now how our data set looks like.

So above code will look like:

# generate related variables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from matplotlib import pyplot
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data2), std(data2)))
# plot
pyplot.scatter(data1, data2)
pyplot.show()

Output:

data1: mean=100.776 stdv=19.620
data2: mean=151.050 stdv=22.358

A scatter plot of the two variables is created. Because we contrived the dataset, we know there is a relationship between the two variables. This is clear when we review the generated scatter plot where we can see an increasing trend.

Before we look at calculating some correlation scores, we must first look at an important statistical building block, called covariance.

Covariance

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, co variance tells you how two variables vary together.

# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate covariance matrix
covariance = cov(data1, data2)
print(covariance)

The covariance and covariance matrix are used widely within statistics and multivariate analysis to characterize the relationships between two or more variables.

Running the example calculates and prints the covariance matrix.

Output:

[[385.33297729 389.7545618 ]
[389.7545618 500.38006058]]

A problem with covariance as a statistical tool alone is that it is challenging to interpret. This leads us to the Pearson’s correlation coefficient next.

Pearson’s Correlation

Pearson’s correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship.

The pearsonr() SciPy function can be used to calculate the Pearson’s correlation coefficient between two data samples with the same length.

# generate related variables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)

Output:

Pearsons correlation: 0.888

We can see that the two variables are positively correlated and that the correlation is 0.8. This suggests a high level of correlation, e.g. a value above 0.5 and close to 1.0.

Spearman’s Correlation

Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables.

If you are unsure of the distribution and possible relationships between two variables, Spearman correlation coefficient is a good tool to use.

from numpy.random import randn
from numpy.random import seed
from scipy.stats import spearmanr

# seed random number generator
seed(1)

# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# calculate spearman's correlation
corr, _ = spearmanr(data1, data2)
print('Spearmans correlation: %.3f' % corr)

Output:

Spearmans correlation: 0.872

We already know that the relationship between the variables is linear. Nevertheless, the nonparametric rank-based approach shows a strong correlation between the variables of 0.8.

Visualizing Data

In addition to calculating the numerical quantities like mean, median, or variance, you can use visual methods to present, describe, and summarize data. In this section, you’ll learn how to present your data visually using the following graphs:

  • Histograms
  • Pie charts
  • Scatter Plot
  • Bar charts

You will be using Matplotlib which is the oldest and most widely-used Python library for data visualization.Matplotlib is the oldest and most widely-used Python library for data visualization.

Histograms

A histogram is an accurate representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable. It is a kind of bar graph.

To construct a histogram, follow these steps −

  • Bin the range of values.
  • Divide the entire range of values into a series of intervals.
  • Count how many values fall into each interval.

The bins are usually specified as consecutive, non-overlapping intervals of a variable.

Here is an example:

import matplotlib.pyplot as plt

x = [1,1,2,3,3,5,7,8,9,10,
10,11,11,13,13,15,16,17,18,18,
18,19,20,21,21,23,24,24,25,25,
25,25,26,26,26,27,27,27,27,27,
29,30,30,31,33,34,34,34,35,36,
36,37,37,38,38,39,40,41,41,42,
43,44,45,45,46,47,48,48,49,50,
51,52,53,54,55,55,56,57,58,60,
61,63,64,65,66,68,70,71,72,74,
75,77,81,83,84,87,89,90,90,91
]

plt.hist(x, bins=[0,10,20,30,40,50,60,70,80,90,99])
plt.show()

Output:

Pie charts

A Pie Chart can only display one series of data. Pie charts show the size of items (called wedge) in one data series, proportional to the sum of the items. The data points in a pie chart are shown as a percentage of the whole pie.

Following code uses the pie() function to display the pie chart of the list of students enrolled for various computer language courses.

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
langs = ['C', 'C++', 'Java', 'Python', 'PHP']
students = [23,17,35,29,12]
ax.pie(students, labels = langs,autopct='%1.2f%%')
plt.show()

Note — Here autopct is basically formatting the numeric values.

Output:

Scatter Plot

Scatter plots are used to plot data points on horizontal and vertical axis in the attempt to show how much one variable is affected by another.

The script below plots a scatter diagram of grades range vs grades of boys and girls in two different colors.

import matplotlib.pyplot as plt

girls_grades = [89, 90, 70, 89, 100, 80, 90, 100, 80, 34]
boys_grades = [30, 29, 49, 48, 100, 48, 38, 45, 20, 30]
grades_range = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

fig=plt.figure()
ax=fig.add_axes([0,0,1,1])
ax.scatter(grades_range, girls_grades, color='r')
ax.scatter(grades_range, boys_grades, color='b')
ax.set_xlabel('Grades Range')
ax.set_ylabel('Grades Scored')
ax.set_title('scatter plot')
plt.show()

Output:

Bar charts

A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally.

Following is a simple example of the Matplotlib bar plot. It shows the number of students enrolled for various courses offered at an institute.

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
langs = ['C', 'C++', 'Java', 'Python', 'PHP']
students = [23,17,35,29,12]
ax.bar(langs,students)
plt.show()

Output:

When comparing several quantities and when changing one variable, we might want a bar chart where we have bars of one color for one quantity value.

import numpy as np
import matplotlib.pyplot as plt
data = [[30, 25, 50, 20],
[40, 23, 51, 17],
[35, 22, 45, 19]]
X = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(X + 0.00, data[0], color = 'b', width = 0.25)
ax.bar(X + 0.25, data[1], color = 'g', width = 0.25)
ax.bar(X + 0.50, data[2], color = 'r', width = 0.25)

Output:

What More

Here are few additional resources for you to learn:

Closing Remarks:

I hope it was helpful for you all. Feel free to share your ideas.

  • Thanks for reading! If you enjoy reading this post, got help, knowledge, inspiration, and motivation through it, and you want to support me — you can “buy me a coffee.” Your support really makes a difference ❤️
  • Receive an email whenever I publish an article and consider being a member if you liked the story.
  • If you enjoyed this post…it would mean a lot to me if you could click on the “claps” icon…up to 50 claps allowed — Thank You!

--

--

Shahzaib Khan

Developer / Data Scientist / Computer Science Enthusiast. Founder @ Interns.pk You can connect with me @ https://linkedin.com/in/shahzaibkhan/