Descriptive Statistics with Python

Valentina Alto
Aug 21 · 6 min read

Descriptive Statistics is that branch of Statistics which analyzes brief descriptive coefficients that summarize a given data set. Those coefficients are called ‘descriptive statistics’.

Descriptive statistics hence provides information about the data you want to analyze and describes them in some manners. More specifically, it collects from the target dataset (population) some descriptive measurements (called parameters). It differs from Inferential Statistics since the latter aims at making assumptions and inference about a population starting from a smaller representation of it, called sample. Indeed, it often happens that the size of the population is far too large to be manipulated, and you’d rather rely on a smaller representation of it. The latter is called sample. From your sample then, you can compute some sample measurements (sample mean, sample variance and so forth) which will be the estimates (or statistics) of the real parameters.

Here I’m going to dwell on descriptive statistics using statistical packages in Python. For this purpose, I’m going to use the available Iris Dataset, yet analyzing only one feature (‘sepal_length’).

import seaborn as sns
iris = sns.load_dataset("iris")
X = iris.drop('species',axis=1)
x=X['sepal_length']

Now, among descriptive statistics, we can isolate two categories which I’m going to focus on separately: measures of central tendency and measures of dispersion.

Measures of Central Tendency

Those metrics tell us how our data behave like in their ‘middle’. What does it mean ‘middle’ though? Well, it depends on the metric we are talking about, hence let’s see all of them:

  • Mean: it is the average value of our data and it has a very easy computation. Just take the sum of your values, divide it by the number of values and, voilà, you have your mean.

We can manually compute it on Python:

mean=x.sum()/len(x)
print('Mean: {}'.format(mean))
Output: Mean: 5.84333333333333

or use the built-in function in the module statistics.

import statistics
print('Mean: {}'.format(x.mean()))
Output: Mean: 5.84333333333333
  • Median: it is the number that lies in the middle of a list of ordered numbers (which may be an ascending or descending order).

If the number of values is even, the middle positional value is obtained as the average of the two central values:

Let’s see which is the median value of our feature:

print('Median: {}'.format(x.median()))Output: Median: 5.8
  • Mode: it is defined as the value that appears the most frequently in our data. If a value appears repeatedly throughout the data, we also know it will influence the average towards the modal value, hence it is an important measure of central tendency of our data.

Let’s compute it for our feature:

mode=x.mode()
print('Mode: {}'.format(mode))
Output: Mode: 5.0

Now, before concluding this paragraph about the measures of central tendency, it is worth spending some words about the relationship among mean, median and mode.

As anticipated, all of them describe the behavior of data in their ‘middle’. Furthermore, if compared to each other, they can also suggest something about the shape of the probability distribution of our data. Indeed, whenever mean=median=mode, the distribution is known to be symmetric ( that means, the probability of being any given distance on one side of the value about which symmetry occurs is the same as the probability of being the same distance on the other side of that value).

Namely, Normal distribution is symmetric, hence can easily demonstrate that mean=median=mode.

On the other hand, if have either mode>median>mean or mean>median>mode, we are facing an asymmetric distribution:

The first picture (mean>median>mode) shows a positive asymmetric distribution, while the second one (mode>median>mean) exhibits a negative asymmetry. A way to check the symmetry of distribution is thought the measure of Skewness:

Positive skewness indicates a positive asymmetry, while a negative skewness indicates negative asymmetry.

Now, recalling the parameters obtained from our dataset:

Mean: 5.843333333333335
Median: 5.8
Mode: 5.0

We can see that mean>median>mode, hence the distribution should exhibit a positive skewness. Let’s check it out:

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})sns.boxplot(x, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
ax_box.axvline(mode, color='b', linestyle='-')
sns.distplot(x, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
ax_hist.axvline(mode, color='b', linestyle='-')
plt.legend({'Mean':mean,'Median':median,'Mode':mode})ax_box.set(xlabel='')
plt.show()

Let’s compute the Skewness:

from scipy.stats import skew
print('Skewness: {}'.format(skew(x)))
Output: Skewness: 0.3117530585022963

As you can see, the skewness is greater than zero, hence it indicates a positive asymmetry, confirming the fact that mean>median>mode.

Measures of Dispersion

If the measures of central tendency dwell on the ‘average’ behavior of our data, on the other hand, measures of dispersion focus on how much our data tend to vary.

  • Range: it is the most intuitive measure of dispersion. It is computed as the difference between the maximum and minimum values, and it suggests how spread our data are.
r=x.max()-x.min()
print('Range: ',r)
Output: Range: 3.6000000000000005

We can also visualize that dispersion with the aid of a boxplot:

fig1, ax1 = plt.subplots()
ax1.set_title('BoxPlot')
ax1.boxplot(x)
  • Interquantile Range: quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. More specifically, the 0.x quantile (also called x percentile) leaves at its left x% of the probability. Furthermore, the 0, 0.25, 0.5, 0.75 and 1 quantiles are known as, respectively, 0, 1st, 2nd,3rd and 4th quartiles. Note that the Second Quartile (Q2) corresponds to the median value, since it leaves 50% of the probability to its left and 50% of probability to its right.
Q1=np.percentile(x, 25) 
Q2=np.percentile(x, 50)
Q3=np.percentile(x, 75)
print('Q1: {}'.format(Q1))
print('Q2: {}'.format(Q2))
print('Q3: {}'.format(Q3))
Output: Q1: 5.1
Q2: 5.8
Q3: 6.4

Let’s visualize them on our distribution graph:

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})sns.boxplot(x, ax=ax_box)
ax_box.axvline(Q1, color='r', linestyle='--')
ax_box.axvline(Q2, color='g', linestyle='-')
ax_box.axvline(Q3, color='b', linestyle='-')
sns.distplot(x, ax=ax_hist)
ax_hist.axvline(Q1, color='r', linestyle='--')
ax_hist.axvline(Q2, color='g', linestyle='-')
ax_hist.axvline(Q3, color='b', linestyle='-')
plt.legend({'Q1':Q1,'Q2':Q2,'Q3':Q3})ax_box.set(xlabel='')
plt.show()

Interquantiles ranges are the difference between two quantiles. In particular, in the boxplot above you can see the interquartile range, computed as the difference between Q3 and Q1.

IQR=Q3-Q1
print('Interquartile Range: ',IQR)
Output: Interquartile Range: 1.3000000000000007
  • Variance: it summarizes how much your data differs from the mean and it is computed as follows:

Where mu is the mean of the population.

However, since it often happens that the square of your unit of measurement is meaningless (namely, what does it means “the variance of my portfolio is 120 squared dollars”?), a more manageable measure it’s used: the standard deviation, which is nothing but the square root of the variance:

Let’s compute the Standard Deviation of our feature:

sigma=np.std(x)
print('Standard Deviation: ',sigma)
Output: Standard Deviation: 0.8253012917851409

Conclusion

Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the data and, together with simple graphics analysis, they form the basis of every statistical and Machine Learning analysis.

Hence, it is always a good practice to start with those simple computations, before diving into the building of complex models.

DataSeries

Connecting data leaders and curating their thoughts 💡

Valentina Alto

Written by

Machine Learning and Statistics enthusiast, currently pursuing a MSc in Data Science at Bocconi University.

DataSeries

Connecting data leaders and curating their thoughts 💡

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade