Essential Statistics for Data Assessment
In this article we are going to learn about essential statistics for data assessment, also often referred to as descriptive statistics. Descriptive statistics provide simple, quantitative summaries of datasets, usually combined with graphics. As we shall shortly see, they can demonstrate the tendency to centralization, provide measures of the variability of features, and much more besides.
Note that there is another kind of statistics, known as inferential statistics, which tries to learn information from the distribution of the population that the dataset was generated or sampled from. Here, we assume the data covers a whole population rather than a subset sampled from a distribution.
The topics we shall cover include:
- Classifying numerical and categorical variables
- Understanding mean, median, and mode
- Learning about variance, standard deviation, percentiles, and skewness
- Knowing how to handle categorical variables and mixed data types
- Using bivariate and multivariate descriptive statistics.
Practical examples are provided using Python, which is probably the most popular programming language for data science.
Classifying numerical and categorical variables
Descriptive statistics are all about variables. You must know what you are describing to define corresponding descriptive statistics.
A variable is sometimes referred to as a feature or attribute in other literature. They all mean the same thing: a single column in a tabulated dataset.
In this section, you will examine the two most important variable types, numerical and categorical, and learn to distinguish between them. Categorical variables are discrete and usually represent a classification property of an item. Numerical variables are continuous and are descriptive quantitatively. Descriptive statistics that can be applied to one kind of variable may not be applicable to the other kind, and hence distinguishing between them precedes analytics.
Distinguishing between numerical and categorical variables
To understand the differences between the two types of variables, I will use the example of the population estimates dataset released by the United States Department of Agriculture. It contains the estimated population data at county level for the United States from 2010 to 2018. You can obtain the data from the official website, https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/.
The following Python code snippet loads the data and examines the first several rows:
import pandas as pd
df = pd.read_excel("PopulationEstimates.xls",skiprows=2)
df.head(8)
The output is a table with more than 140 columns. Here is a screenshot showing the beginning columns:
In the dataset there is a variable called Rural-urban_Continuum Code_2013. It takes integer values. This leads to pandas auto-interpreting this variable; pandas auto-interprets it as numerical. Instead, however, the variable is actually categorical.
Should you always trust libraries?
Don’t always trust the functions that Python libraries give you. They may be wrong, and the developer, who is you, has to make the final decision.
After some research, we found the variable description on this page: https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/.
According to the code standard published in 2013, the Rural-urban_Continuum_Code_2013 variable indicates how urbanized an area is.
The meaning of Rural-urban_Continuum Code_2013 is shown in Figure 2.
Note
Pandas makes intelligent auto-interpretations of variable types, but oftentimes it is wrong. It is up to the data scientist to investigate the exact meaning of the variable type and then change it.
Many datasets use integers to represent categorical variables. Treating them as numerical values may result in serious consequences in terms of downstream tasks such as machine learning, mainly because artificial distances between numerical values will be introduced.
On the other hand, numerical variables often have a direct quantitative meaning. For example, R_NET_MIG_2013 means the rate of net immigration in 2013 for a specific area. A histogram plot of this numerical variable gives a more descriptive summary of immigration trends in the States, but it makes little sense plotting the code beyond simple counting.
Let’s check the net immigration rate for the year 2013 with the following code snippet:
plt.figure(figsize=(8,6))
plt.rcParams.update({'font.size': 22})
plt.hist(df["R_NET_MIG_2013"],bins=np.linspace(np.nanmin(df["R_
NET_MIG_2013"]),np.nanmax(df["R_NET_MIG_2013"]),num=100))
plt.title("Rate of Net Immigration Distribution for All
Records, 2013");
The result appears as follows.
Here are the observations drawn from Figure 3:
- With both categorical and numerical variables, structures can be introduced to construct special cases. A typical example is date or time. Depending on the scenario, date and time can be treated as either a categorical variable or as a numerical variable with a semi-continuous structure.
- It is common to convert numerical variables to categorical variables on the basis of a number of rules. The rural-urban code is a typical example. Such a conversion is easy for conveying a first impression.
Now that we have learned how to distinguish between numerical and categorical variables, let’s move on to understanding a few essential concepts of statistics, namely mean, median, and mode.
Understanding mean, median, and mode
Mean, median, and mode describe aspects of the central tendency. Mean and median are only applicable to numerical variables, whereas mode is applicable to both categorical and numerical variables. In this section, we will focus on mean, median, and mode for numerical variables, as their numerical interactions usually convey interesting information.
Mean
Mean, or arithmetical mean measures the weighted center of a variable. Let’s use n to denote the total number of entries and as the index. The mean is given by the following expression:
Mean is influenced by the value of every entry in the population.
Let me give an example. The following code generates 1,000 random numbers from 0 to 1 uniformly, plots them, and calculates their mean:
import random
random.seed(2019)
plt.figure(figsize=(8,6))
rvs = [random.random() for _ in range(1000)]
plt.hist(rvs, bins=50)
plt.title("Histogram of Uniformly Distributed RV");
The resulting histogram plot appears as follows:
The mean is around 0.505477, pretty close to what we’d expect.
Median
Median measures the unweighted center of a variable. If there is an odd number of entries, the median takes the value of the central one. If there is an even number of entries, the median takes the value of the mean of the central two entries. The median may not be influenced by every entry’s value. On account of this property, the median is more robust or representative than the mean value. I will use the same set of entries as in previous sections as an example.
The following code calculates the median:
np.median(rvs)
The result is 0.5136755026003803. Now, I will be changing one entry to 1,000, which is 1,000 times larger than the maximal possible value in the dataset, and repeat the calculation:
rvs[-1]=1000
print(np.mean(rvs))
print(np.median(rvs))
The results are 1.5054701085937803 and 0.5150437661964872. The mean increased by roughly 1, but the median is robust.
The relationship between mean and median is usually interesting and worth investigating. Usually, the combination of a larger median and smaller mean indicates that there are more points on the bigger value side, but that an extremely small value also exists. The reverse is true when the median is smaller than the mean. We will demonstrate this with some examples later.
Mode
The mode of a set of values is the most frequent element in a set. It is evident in a histogram plot such that it represents the peak(s). If the distribution has only one mode, we call it unimodal. Distributions with two peaks that don’t have to have equal heights are referred to as bimodal.
Bimodals and bimodal distribution
Sometimes, the definition of bimodal is corrupted. The property of being bimodal usually refers to the property of having two modes, which, according to the definition of mode, requires the same height of peaks. However, the term bimodal distribution often refers to a distribution with two local maxima. Double-check your distribution and state the modes clearly.
The following code snippet demonstrates two distributions with unimodal and bimodal shapes respectively:
r1 = [random.normalvariate(0.5,0.2) for _ in range(10000)]
r2 = [random.normalvariate(0.2,0.1) for _ in range(5000)]
r3 = [random.normalvariate(0.8,0.2) for _ in range(5000)]fig, axes = plt.subplots(1,2,figsize=(12,5))
axes[0].hist(r1,bins=100)
axes[0].set_title("Unimodal")
axes[1].hist(r2+r3,bins=100)
axes[1].set_title("Bimodal");
The resulting two subplots look as follows:
So far, we have talked about mean, median, and mode, which are the first three statistics of a dataset. They are the start of almost all exploratory data analysis.
Learning about variance, standard deviation, quartiles, percentiles, and skewness
In the previous section, we studied the mean, median, and mode. They all describe, to a certain degree, the properties of the central part of the dataset. In this section, we will learn how to describe the spreading behavior of data.
Variance
Using the same notation, variance for the population is defined as follows:
Intuitively, the further away the elements are from the mean, the larger the variance. Here, I’ve plotted the histogram of two datasets with different variances. The one in the left sub-plot has a variance of 0.09 and the one in the right sub-plot has a variance of 0.009, i.e. 10 times smaller.
The following code snippet generates samples from the two distributions and plots them:
r1 = [random.normalvariate(0.5,0.3) for _ in range(10000)]
r2 = [random.normalvariate(0.5,0.1) for _ in range(10000)]fig, axes = plt.subplots(1,2,figsize=(12,5))
axes[0].hist(r1,bins=100)
axes[0].set_xlim([-1,2])
axes[0].set_title("Big Variance")
axes[1].hist(r2,bins=100)
axes[1].set_title("Small Variance")
axes[1].set_xlim([-1,2]);
The results are as follows:
The following code snippet generates a scatter plot that will demonstrate the difference more clearly. The variable on the x axis spreads more widely:
plt.figure(figsize=(8,8))
plt.scatter(r1,r2,alpha=0.2)
plt.xlim(-1,2)
plt.ylim(-1,2)
plt.xlabel("Big Variance Variable")
plt.ylabel("Small Variance Variable")
plt.title("Variables With Different Variances");
The result looks as follows:
The spread in the x axis is significantly larger than the spread in the y axis, which indicates the differences in variance magnitude. A common mistake is not getting the range correct. Matplotlib will, by default, try to determine the ranges. You need to use a code such as plt.xlim() to force it, otherwise the result can be misleading.
Standard deviation
Standard deviation is the square root of the variance. It is used more commonly to measure the level of dispersion since it has the same unit as the original data. The formula for the standard deviation of a population is:
Standard deviation is extremely important in scientific graphing. A standard deviation is often plotted together with the data, and represents an estimate of variability.
For this article, I will be using the net immigration rate for Texas from 2011 to 2018 as an example. In the following code snippet, I will first extract the county-level data, append the means and standard deviations to a list, and then plot them at the end. The standard deviation is obtained using numpy.std() and the error bar is plotted using matplotlib.pyplot.errorbar():
dfTX = df[df["State"]=="TX"].tail(-1)YEARS = [year for year in range(2011,2019)]
MEANS = []
STDS = []
for i in range(2011,2019):
year = "R_NET_MIG_"+str(i)
MEANS.append(np.mean(dfTX[year]))
STDS.append(np.std(dfTX[year]))plt.figure(figsize=(10,8))
plt.errorbar(YEARS,MEANS,yerr=STDS)
plt.xlabel("Year")
plt.ylabel("Net Immigration Rate");
The output appears as shown in the following figure:
We can see in Figure 8 that although the net immigration in Texas is only slightly positive, the standard deviation is huge. Some counties may have a big positive net rate, while others may potentially suffer from the loss of human resources.
Quartiles
Quartiles are a special kind of quantile that divides data into a number of equal portions. For example, quartiles divide data into four equal parts with the 1/2 quartile as the median. Deciles and percentiles divide data into 10 and 100 equal parts, respectively.
The first quartile, also known as the lower quartile Q1 takes the value such that 25% of all the data lies below it. The second quartile is the median. The third quartile, Q3, is also known as the upper quartile and 25% of all values lie above it.
Quartiles are probably the most commonly used quantiles because they are associated with a statistical graph called a boxplot. Let’s use the same set of Texas net immigration data to study it.
The function in NumPy is quantile() and we specify a list of quantiles as an argument for the quantiles we want to calculate, as in the following single-line code snippet:
np.quantile(dfTX["R_NET_MIG_2013"],[0.25,0.5,0.75])
The output reads as follows:
plt.figure(figsize=(12,5))
plt.hist(dfTX["R_NET_MIG_2013"],bins=50,alpha=0.6)
for quartile in np.quantile(dfTX["R_NET_
MIG_2013"],[0.25,0.5,0.75]):
plt.axvline(quartile,linestyle=':',linewidth=4)
As you can see from the following output, the vertical dotted lines indicate the three quartiles:
The lower and upper quartiles keep exactly 50% of the data values in between. Q3 — Q1 is referred to as the interquartile range called Interquartile Range (IQR) and it plays an important role in outlier detection. We will see more about this shortly.
Skewness
Skewness differs from the three measures of variability discussed in the previous subsections. It measures the direction the data takes and the extent to which the data distribution tilts. Skewness is given by the following equation:
Various definitions of skewness
The skewness we defined earlier is precisely referred to as Pearson’s first skewness coefficient. It is defined through the mode, but there are other definitions of skewness. For example, skewness can be defined through the median.
Skewness is unit-less. If the mean is larger than the mode, skewness is positive, and we say the data is skewed to the right. Otherwise, the data is skewed to the left.
Here is a code snippet that generates two sets of skewed data and plots them:
r1 = [random.normalvariate(0.5,0.4) for _ in range(10000)]
r2 = [random.normalvariate(0.1,0.2) for _ in range(10000)]
r3 = [random.normalvariate(1.1,0.2) for _ in range(10000)]fig, axes = plt.subplots(1,2,figsize=(12,5))
axes[0].hist(r1+r2,bins=100,alpha=0.5)
axes[0].axvline(np.mean(r1+r2), linestyle=':',linewidth=4)
axes[0].set_title("Skewed To Right")
axes[1].hist(r1+r3,bins=100,alpha=0.5)
axes[1].axvline(np.mean(r1+r3),linestyle=':',linewidth=4)
axes[1].set_title("Skewed to Left");
The vertical dotted line indicates the position of the mean as follows:
Think about the problem of income inequality. Let’s say you have a plot of the histogram of the population with different amounts of wealth. A larger value just like where the x axis value indicates the amount of wealth and the y axis value indicates the portion of the population that falls into a certain wealth amount range. A larger x value means more wealth. A larger y value means a greater percentage of the population falls into that range of wealth possession. Positive skewness (the left subplot in Figure 10) means that even though the average income looks good, this may be driven up by a very small number of super-rich individuals when the majority of people earn a relatively small income. Negative skewness (the right subplot in Figure 10) indicates that the majority may have an income above the mean value, so there might be some very poor people who may need help.
Revisiting outlier detection
Now let’s use what we have learned to revisit the outlier detection problem.
The z-score, also known as the standard score, is a good criterion for detecting outliers. It measures the distance between an entry and the population mean, taking the population variance into consideration:
If the underlying distribution is normal, a situation where a z-score is greater than 3 or less than 0 only has a probability of roughly 0.27%. Even if the underlying distribution is not normal, Chebyshev’s theorem guarantees a strong claim such that at most 1/(k^2), where k is an integer, of the total population can fall outside k standard deviations.
As an example, the following code snippet generates 10,000 data points that follow a normal distribution:
random.seed(2020)
x = [random.normalvariate(1, 0.5) for _ in range(10000)]
plt.figure(figsize=(10,8))
plt.hist(x,bins=100,alpha=0.5);
styles = [":","--","-."]
for i in range(3):
plt.axvline(np.mean(x) + (i+1)*np.std(x),
linestyle=styles[i],
linewidth=4)
plt.axvline(np.mean(x) - (i+1)*np.std(x),
linestyle=styles[i],
linewidth=4)
plt.title("Integer Z values for symmetric distributions");
In the generated histogram plot, the dotted line indicates the location where z = ±1 . The dashed line indicates the location of z = ±2 . The dashed-dotted line indicates the location of z = ±3:
If we change the data points, the distribution will change, but the z-score criteria will remain valid. As you can see in the following code snippet, an asymmetric distribution is generated rather than a normal distribution:
x = [random.normalvariate(1, 0.5) + random.expovariate(2) for _
in range(10000)]
This produces the following output:
Note on the influence of extreme outliers
A drawback of the z-score is that the mean itself is also influenced by extreme outliers. The median can replace a mean to remove this effect.
We have covered several of the most important statistics to model variances in a dataset. In the next section, let’s work on the data types of features.
Knowing how to handle categorical variables and mixed data types
Categorical variables usually have simpler structures or descriptive statistics than continuous variables. Here, we introduce frequencies and proportions and talk about some interesting descriptive statistics examples when converting continuous variables to categorical ones.
We have covered several of the most important statistics to model variances in a dataset. In the next section, let’s work on the data types of features.
Knowing how to handle categorical variables and mixed data types
Categorical variables usually have simpler structures or descriptive statistics than continuous variables. Here, we introduce frequencies and proportions and talk about some interesting descriptive statistics examples when converting continuous variables to categorical ones.
Frequencies and proportions
When we discussed the mode for categorical variables, we introduced Counter, which outputs a dictionary structure whose key-value pair is the element-counting pair. The following is an example of a counter:
Counter({2.0: 394, 3.0: 369, 6.0: 597, 1.0: 472, 9.0: 425, 7.0:
434, 8.0: 220, 4.0: 217, 5.0: 92})
The following code snippet illustrates frequency as a bar plot where the absolute values of counting become intuitive:
counter = Counter(df["Rural-urban_Continuum Code_2013"].
dropna())
labels = []
x = []
for key, val in counter.items():
labels.append(str(key))
x.append(val)
plt.figure(figsize=(10,8))
plt.bar(labels,x)
plt.title("Bar plot of frequency");
This produces a bar plot like the following:
For proportionality, simply divide each count by the summation of counting, as shown in the following code snippet:
x = np.array(x)/sum(x)
The shape of the bar plot remains the same, but the y axis ticks change. To better check the relative size of components, I have plotted a pie plot with the help of the following code snippet:
plt.figure(figsize=(10,10))
plt.pie(x=x,labels=labels,)
plt.title("Pie plot for rural-urban continuum code");
This creates a nice pie chart as follows:
It becomes evident that code 2.0 contains about twice as many samples as code 8.0 does.
Unlike the mean and median, categorical data does have a mode. We are going to reuse the same data:
Counter(df["Rural-urban_Continuum Code_2013"].dropna())
The output reads as follows:
Counter({2.0: 394, 3.0: 369, 6.0: 597, 1.0: 472, 9.0: 425, 7.0:
434, 8.0: 220, 4.0: 217, 5.0: 92})
The mode is 6.0.
Note
The mode means that the counties with urban populations of 2,500 to 19,999 adjacent to a metro area are most prevalent in the United States, and not the number 6.0.
Transforming a continuous variable to a categorical one
Occasionally, we may need to convert a continuous variable to a categorical one. Let’s take lifespan as an example. The 80+ age group is supposed to be very small. Each of them will represent a negligible data point in classification tasks. If they can be grouped together, the noise introduced by the sparsity of this age group’s individual points will be reduced.
A common way to perform categorization is to use quantiles. For example, quartiles will divide the datasets into four parts with an equal number of entries. This avoids issues such as data imbalance.
For example, the following code indicates the cut-offs for the categorization of net immigration rate, a continuous variable:
series = df["R_NET_MIG_2013"].dropna()
quantiles = np.quantile(series,[0.2*i for i in range(1,5)])
plt.figure(figsize=(10,8))
plt.hist(series,bins=100,alpha=0.5)
plt.xlim(-50,50)
for i in range(len(quantiles)):
plt.axvline(quantiles[i],linestyle=":", linewidth=4)
plt.title("Quantiles for net immigration data");
As you can see in the following output, the dotted vertical lines split the data into 5 equal sets, which are hard to spot with the naked eye. I truncated the x axis to select the part between -50 and 50. The result looks as follows:
Note on the loss of information
Categorization destroys the rich structure in continuous variables. Only use it when you absolutely need to.
Using bivariate and multivariate descriptive statistics
In this section, we briefly talk about bivariate descriptive statistics. Bivariate descriptive statistics apply two variables rather than one. We are going to focus on correlation for continuous variables and cross-tabulation for categorical variables.
Covariance
The word covariance is often incorrectly used as correlation. However, there are a number of fundamental differences. Covariance usually measures the joint variability of two variables, while correlation focuses more on the strength of variability. Correlation coefficients have several definitions in different use cases. The most common descriptive statistic is the Pearson correlation coefficient. We will also be using it to describe the covariance of two variables. The correlation coefficient for variables x and y from a population is defined as follows:
Let’s first examine the expression’s sign. The coefficient becomes positive when x is greater than its mean and y is also greater than its own mean.
Another case is when x and y are both smaller than their means, respectively. The products sum together and then get normalized by the standard deviation of each variable. So, a positive coefficient indicates that x and y vary jointly in the same direction. You can make a similar argument about negative coefficients.
In the following code snippet, we select the net immigration rates for counties in Texas as our datasets and use the corr() function to inspect the correlation coefficient across years:
corrs = dfTX[['R_NET_MIG_2011','R_NET_MIG_2012', 'R_NET_
MIG_2013', 'R_NET_MIG_2014', 'R_NET_MIG_2015','R_NET_MIG_2016',
'R_NET_MIG_2017', 'R_NET_MIG_2018']].corr()
The output is a so-called correlation matrix whose diagonal elements are the self- correlation coefficients, which are just 1:
A good way to visualize this matrix is to use the heatmap() function from the Seaborn library. The following code snippet generates a nice heatmap:
import seaborn as sns
plt.figure(figsize=(10,8))
plt.rcParams.update({'font.size': 12})
sns.heatmap(corrs,cmap="YlGnBu");
The result is as follows:
We do see an interesting pattern that odd years correlate with one another more strongly, and even years correlate with each other more strongly, too. However, that is not the case between even- and odd-numbered years. Perhaps there is a 2-year cyclic pattern and the heatmap of the correlation matrix just helped us discover it.
Cross-tabulation
Cross-tabulation can be treated as a discrete version of correlation detection for categorical variables. It helps derive innumerable insights and sheds light on downstream task designs.
Here is an example. I am creating a list of weather information and another list of a golfer’s decisions on whether to go golfing. The crosstab() function generates the following table:
weather = ["rainy","sunny","rainy","windy","windy",
"sunny","rainy","windy","sunny","rainy",
"sunny","windy","windy"]
golfing = ["Yes","Yes","No","No","Yes","Yes","No","No",
"Yes","No","Yes","No","No"]
dfGolf = pd.DataFrame({"weather":weather,"golfing":golfing})
pd.crosstab(dfGolf.weather, dfGolf.golfing, margins=True)
As you can see, the columns and rows give the exact counts, which are identified by the column name and row name. For a dataset with a limited number of features, this is a handy way to inspect imbalance or bias.
We can tell that the golfer goes golfing if the weather is sunny, and that they seldom go golfing on rainy or windy days.
In conclusion
In this article, we’ve explained what descriptive statistics are and what they can tell you. You have seen how to extract information from your datasets and use descriptive statistics to make quantitative judgments. Hopefully, it is now clear that an understanding of the different descriptive statistics, knowledge of when to use them, and the practical ability to derive them from your data, are indispensable items in the data scientist’s toolkit.
Equipped with this knowledge, you can reinforce your understanding of data science and data analysis from a statistical perspective and extract meaningful insights from your data using Python programming. Find out how with Rongpeng Li’s book Essential Statistics for Non-STEM Data Analysts.