Introduction to Statistics for Data Science
Basic Level — The Fundamentals of Descriptive Statistics
Statistics is a big part of a Data Scientist’s daily living. Each time you start an analysis, your first steps before applying fancy algorithms and making some predictions is to first do some exploratory data analysis (EDA) and try to read and understand the data by applying statistical techniques. With this first data analysis, you are able to understand what type of distribution the data presents.
At the end of this brief introduction, we will use the dataset of Lego dataset to make sense of these concepts.
What is Descriptive Statistics?
Descriptive statistics is the analysis of data which helps to describe, show or summarize information in a meaningful way such that, whoever is looking at it might detect certain relevant patterns.
When looking at data the first step of your statistic analysis will be to determine if the dataset you’re dealing with is a population or a sample.
A population is the collection of all items of interest in your study and it’s generally denoted with the capital letter
N . The calculated values when analysing a population are known as parameters. On the other hand, a sample is a subset of a population and it’s usually denoted by the letter
n . The values calculated when using a sample are known as statistics.
This is why this field is known as Statistics! Shocker! 🤯
Populations are hard to define and analyze in real life. It is easy to miss values when studying a population which will influence the analysis, as well as an analysis of the whole population is very expensive and time-consuming.
Therefore, you normally hear about samples. In opposite to a population, a sample is not expected to account for all the data and is easier to analyze since it’s smaller size makes the analysis less time consuming, less costly and less prone to error. A sample must be random and both representative of the population. With a sample, anyone can make deductions on the population.
Types of data
In a dataset, data can be either Categorical or Numerical.
Categorical data describes groups or categories such as car brands, gender, age groups, names, etc. On the other hand, numerical data just as the name reveals represents numbers. Within this category, you can have Discrete and Continuous numbers.
- Discrete - data which can only take certain values. You only have a fixed set of values you have access to. For example, age, number of cars in a street, number of fingers.
- Continuous - data which can take any real or fractional value between a certain range, without any restrictions (e.g. weight, Balance in a bank account, value spent on thepurchase, Grade on Exam,Foot Size)
Levels of Measurement
Data can have two levels of measurement : Qualitative and Quantitative.
Qualitative Data is information that characterizes attributes in data but does not measure them. It can be divided into two types: Nominal or Ordinal.
- Nominal : They are not numbers and cannot be put in any order;
Example : names
- Ordinal : Consists of groups and categories that follow a strict order.
Example : Grades (e.g. Bad, Satisfy, Good)
Quantitative Data measures attributes in the data. It can be divided into two groups : Interval and Ratio
- Interval : Represented by numbers, without having a true zero. In this case, the zero value is meaningless.
- Ratio: Represented by numbers and has a true zero.
For quantitative data to be regarded as an interval or ratio, it depends on the context we are using them in. For example, think about temperature. Saying it is 0º Celsius or 0º Fahrenheit has no meaning, since that is not the true zero. The absolute zero temperature in Celsius is -273.15 ºC whereas in Fahrenheit is -459.67º F. Therefore, in this case, the temperature has to be considered as Interval data, since the zero value is meaningless.
However, if you analyse temperature in Kelvins, the absolute zero temperature is 0º Kelvin, thus you can say now the temperature value is a Ratio since it has a true zero.
The Normal Distribution is one of the most important concepts in Statistics, since the majority of tests require normally distributed data. Normal distribution describes how equally your data points are distributed along a given scale and where the majority of data accumulates towards the center. This distribution is also known as Gaussian curve.
A Normal Distribution exists if your data is symmetrical, bell-shaped, centered and unimodal.
Measure of Central Tendency
The measure of central tendency refers to the idea there’s one number that best summarizes the entire set. The most popular are mean, median and mode.
This is considered the most reliable measure of the measure of central tendency for making assumptions about a population from a single sample. The μ symbol is used to described the population value whereas the x̅ to describe the sample mean.
We can find the mean by summing all the components and then dividing the sum by the number of components. As already said, it’s the most common measure of central tendency, but it has the downside of being easily affected by outliers. Sometimes, due to outliers, the mean might not be enough to make conclusions.
The median is the midpoint or the “middle” value in your orderly ascending dataset. It is also known as the 50th percentile. In order to avoid the error provoked in the mean by outliers, it is usually a good idea to also calculate the median.
But what about the representation that most values give?
The mode shows us the value that occurs most often. It can be used for numerical as well as categorical variables. If there are not a single value does not appear more than once you say there is no mode.
Which is the best measure?
The measures should be used together instead of independently. There is no best and using only one is not advisable. Moreover, in a normal distribution, these measures all fall at the same midline point. This means that the mean, mode and median are all equal!
Measures of variability
The measure of variability refers to the idea of measuring the dispersion in our data according to the mean value. The most known measures of variability are the range, interquartile range (IQR), variance and standard deviation.
The range is the most obvious measure of dispersion and describes the difference between the largest and the smallest points in your data.
Range is 99–12 = 87
- Interquartile range (IQR)
The IQR is a measure of variability between the upper (75th) and lower (25th) quartiles. The data is sorted into ascending order and divided into four quarters.
While the range measures the range of values in which our dataset is distributed, the interquartile range measures the interval of values where the majority of values lies in.
The variance as well as the standard deviation are more complex forms of measuring how much the data disperse from the mean value of the dataset.
The variance is found by computing the difference between every data point and the mean, squaring that value and summing for all available data points. In the end, the variance is calculated by dividing the sum by the total number of available points.
Squaring the difference has two main purposes : Dispersion is non-negative, by powerering the subtraction by 2 we ensure we do not have negative values and thus there is not the chance of them canceling out.
- Amplifies the effect of large differences
The problem with Variance is that because of the squaring, it is not in the same unit of measurement as the original data. This is why the Standard Deviation is used more often because it is in the original unit. Squared dollars means nothing in statistics.
- Standard Deviation
Usually standard deviation is much more meaningful than variance. It is the preferred measure of variability as it is directly interpretable.
Standard deviation is basically the square root of our variance.
Standard deviation is best used when data presents a unimodal shape. In a normal distribution, approximately 34% of data points fall one standard deviation away from the mean. Since a normal distribution is symmetrical, we have 68.2% of data points one standard deviation away from the mean. Around 95% of points fall between two standard deviation from the mean whereas 99.7% fall under three standard deviation.
With the Z-Score, you can check how many standard deviations below (or above) the mean, a specific data point is.
Measure of Assymetry
The modality of a distribution is determined by the number of peaks the data presents. Most distributions are unimodal which means it has only one frequently occurring score, clustered at the top while a bimodal has two values occurring frequently.
It is the most common tool to measure asymmetry. Skewness indicates to which side the data is concentrated. The skewness captures the outliers in the data. If it is left skewed it means the outliers are to the left. Moreover, when the mean is higher than the median we have a right skew. If it’s lower we have a left skew.
Measures of assymmetry are the link between Central Tendency Measures and Probability theory which will ultimately allows us to obtein a more accurate knowlege on the data we are working with.
Let’s take a look at the LEGO Parts/Sets/Colors and Inventories of every official LEGO set, so that we can better understand the basic statistics we’ve seen so far.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np# Main files
lego = pd.read_csv(‘sets.csv’)lego.shape
Starting by the data variables we have available to us, we can start by making the following assumptions:
set_num — Categorical|Ordinal
name — Categorical|Nominal
year — Numerical|Discrete|Interval
theme_id — Numerical|Discrete|Interval
num_parts — Numerical|Discrete|Ratio
Do you know how to answer the following question: what is the average number of parts in the sets of legos? Do you remember which measure of central tendency you can use? What about mean and median?
Here we see a visual distribution from this variable. In blue we have the mean value whereas in green is the median.
We can clearly see this is not a normal distribution. Moreover, the distribution for this variable presents a right skew with most of the outliers being present to the right of the graph. Moreover, since it is not a normal distribution we can clearly see different values for the median (green) and the mean(blue). Remember the mean is affected by outliers.
What is the year that had more sets published? Do you remember which measure of central tendency you can use? What about the mode?
What about non numerical variables? Can Pandas also calculate the mode?
Until now we’ve focused on the centrality measures. What about dispersion? How is our data spread out?
- Standard Deviation
So we can clearly see that much of our data disperse from the mean value of the dataset by 330 unit parts in terms of standard deviation and 109027 units parts squared.
quartiles = [.25, .5, .75]
From the quartile we can see that 75% of our observations have less 172 parts. Using our mean we see that on average a set has 162 parts, so sometimes it is interesting to compare the Quartiles and mean. Moreover, you also see that the quartile 0.50 matches the median as expected.
In conclusion, you can get a summary of all of these measures we’ve seen so far by using
Pandas describe() method.
Read the following post for more on Statistics.
Introduction to Statistics for Data Science
Intermediate Level — The Fundamentals of Descriptive Statistics
If you liked it, follow me for more publications and don’t forget, please, give it an applause!