Descriptive Statistics for Data Science

22 min readOct 4, 2019

Familiarize yourself with these concepts of statistics which are rudimentary to getting “a feel” of your data at a quantitative level.

After going through a good number of blogs by experts in the field of heuristics (Data analytics, data science machine learning ) it was clear that asking the right set of questions is of at most importance when it comes to understanding and discovering insights from the data and moreover it drives the steps we take during EDA(Exploratory Data Analysis) and modeling as well. Therefore I shall try my best to present you with relevant questions, which the corresponding concepts help answer. I will mention them in quote’s

“Like this”

I like to consider statistics as an ancestor of machine learning resulting in some subtle similarities in the way we frame and solve the problems. I will also mention these in quotes.

Intro to Statistics

Terminology
Descriptive vs inferential Statistics
Different Datatypes
Sampling Techniques

Descriptive Statistics

Measure of Central Tendency
Measure of Spread and Variability
Normal Distribution and Z-scores
Confidence Intervals
Correlation
Variance Explained

Introduction to Statistics

Context

Usually when a Machine learning concept is taught its context or setup usually is intended or centered around Modeling(Collecting the data -> Preprocessing -> Modelling -> Prediction).
But Concepts in statistics are not designed to be used in that kind of setup
We will frame our “Problem” as trying to gain insights, verify our assumptions/hypotheses about the data.
Having said that, these concepts and tools are equally capable of fitting into the machine learning pipeline.

Terminology

Let’s get started with Nomenclature:

Population:(Not to be confused with the literal meaning of population )

Set of all possible Entities in our group of interest
E.g. If we are Conducting a study about “Behavior of School children’s between 8th and 10th grade”, then our population is students between 8th and 10th grade around the world.

Population is the entire data of the problem domain (not only of train and test but also on which are model would be predicting), e.g. let us suppose we wanted to train a model which classifies between cats and dogs picures then population would be all possible dog’s and cat pictures

Parameter

The statistical measures we apply over the entire population
Eg: Mean, Mode, correlation, etc.

Sample

A subset of the population
Our training or test dataset

Statistic:(Not to be confused with Statistics)

The statistical measure we apply over Sample

Note:

It is practically impossible(in most cases if not all)cases to work over a population.
The population does not necessarily have to be large, it depends on problem statement of the study, you either explicitly define the size of the population by saying “I am going to conduct study to understand behavior of school students in Delhi” or you implicitly define it by conducting your experiment wherein you only are studying students from Delhi.
Your sample may not necessarily represent the entire population.

Descriptive vs Inferential Statistics

Descriptive Statistics

Branch of Statistics which deals with Quantitatively Analysing or summarizing data(which we have) and drawing conclusions/insights on the same data we have worked with.
All of our conclusions and results are closure to the data we have performed Analysis upon(Intending that we can not generalize our results to the entire population that sample is a part of, yet)

Inferential Statistics

Branch of Statistics which provides us with frameworks and a suite of tests helping us infer(project) our conclusions on the entire population though only the subset of that population has been analyzed.
This ability of ours to infer onto unseen data opens up a portal for new prospects such as “prediction”, “estimation”, etc.
Concept of probability mostly converges with statistics in this area of Inferential Statistics

As you might have guessed Machine learning seems to be a descenedent of Inferential Statistics(that’s one way to look at it).Training our model on train data and making predictions on test data(and data in production)is similar to how inference in Infrentiak statistics work

Data Types

Data types can be segregated based on the Granularity of the data and further based on the scale of it.

Discrete Variables (AKA Categorical Variables )

A type of variable whose domain is a finite set of numbers(i.e variables which can have a finite value)
I like to refer to them as atomic values(as in atoms) because there are no 1.5 atoms it is either one or two atoms.
Examples do a better job of building your intuition as to what these are and how they differ from others
Number of cars is a discrete value (1 car or two cars), different colors are also discrete values -> red, orange, Gender (Male, Female)
There are 2 scales of discrete data types
Nominal: Type of data which has no order to it, which can be considered as a label is called a Nominal scaled data, e.g. Blue, Black, Brown
Ordinal: Type of data which has an order to it, but we are not sure exactly by how much is one greater than another it is called ordinal data Eg: Low < Medium < High, in this example we can not quantify by how much is high greater than medium

Continuous variables

A type of variable that can have infinite number of values.
E.g. Height of a person, 6.1 feet or also can be 6.000…..1 feet, weight of a person, the temperature of the room, etc.
Meaning these variables utilize or leverage the presence of infinitely many number of values present between two numbers on a number line
There are 2 scales of Continuous Data types
Interval: Type of data which has an order to it and we can quantify by how much is one greater than the other is called Interval data E.g. Temperature (0 degrees < 10 degrees < 20 degrees < 30 degrees)
Ratios: Along with all properties of Interval it also establishes the meaning of true zero( zero=non-existent) e.g.Number of sales this week (10 < 20 < 30) but here 0 indicates no sales at all, unlike in the above case where zero actually indicates a degree of temperature.

Note:

Though at a glance distinguishing Discrete and Continuous variables look fairly objective(obvious) but when working with real-world data, The decision to consider a variable Discrete or categorical becomes contextual, e.g. Time of the day, seems continuous but when the values in the dataset only have hours (thus having only 24 values) makes it discrete, moreover for algorithmic reasons we transform one type to another(categorical<==> Discrete)
Ratios give intrinsic meaning to measures like central tendency and spread

Methods of Sampling

As we have already discussed, we would be working mostly with samples of the population so let’s now discuss different methods of sampling

Random Sampling

A type of sampling where each entity in the population has an equal probability to be a part of the sample
There should be no pattern or systemic bias among the selected entities from the population

e.g. If we are performing an experiment in which population is all the students in the world, then a to get a random sample each and every student in the world should have an equal probability of being a part of the sample (irrespective of your convenience to select students nearby your office)

Representative Sampling

A type of sampling where different known and important properties in your population are to be preserved in your sample, such that your sample best represents the population
E.g. Maintaining the sex ratio, racial ratio, age, etc. All these factors are contextual to our study

Convivence Sampling

A type of sampling method which is done on the basis of the proximity/easiest available part of the population to sample
Here easiest indicate easy access

Descriptive Statistics

Measure of Central Tendency

A statistical measure which indicates the central value of the data distribution
A numerical value, which provides us with a summary of the data or a number that is supposed to best represent the given data.
E.g. If I have two sections in a school of a particular grade, who have got their results and I wanted to compare marks of these two classes, I can compare the marks of each student in one section to another (Which is inefficient and naive). So, what we usually do is compare the mean of the two classes, This is what I mean when I say it best represents the dataset. Here the marks of the class is a dataset and the mean of that class best represents it(Here mean is one of the central tendency of the dataset as you will see below)

Central tendency of data can be obtained using three measures

Mean:

It is the arithmetic average of all the values in a set
Calculated as a sum of all numbers divided by the total number of data points

We often compare our Machine Learning model’s prediction with a baseline function which predicts Mean of the data when our dependent variable is continuous, we also use mean of a feature to replace null values present

import numpy as nparray = [1,2,3,4,5,6,7,8,9,10]np.mean(array)

Median:

Literally middle value of the dataset when arranges in ascending order is called Median, incase we have even number of entries then the average of the two middle numbers is considered as a median.

We use it as an alternative to Mean when we have a lot of outliers in the data to replace with the null values of continuous variables in data

Outliers: Entries/Numbers which have Extremely low or high magnitude (Relative to other numbers in the group) are called as Outliers, e.g. [10,20,30,40,9654,46,80,0.001]. Here 0.001 and 9654 are outliers.

import numpy as np array = [1,2,3,4,5,6,7,8,9,10]np.median(array)

Mode:

The value which occurs most often in the data is called Mode of the data

We often compare our Machine Learning model’s prediction with a baseline function which predicts Mode of the data when our dependent variable is Categorical, We use this to replace Null values of categorical variables

from scipy import statsarray = [1,2,3,4,5,6,7,8,9,10]stats.mode(array)

Measure of Spread and Variability

Central Tendency alone can not do a great job in describing the data, This is where Measure of spread and variability comes into play giving us a feel of how diverse is our data numerically

Range:

Statistical measure which quantifies stretch on the number line that our data has covered.
Difference between the maximum and the minimum value of the dataset

image credits: https://sites.google.com/site/piggraphy/home/statistical-skills/interquartile-range-range

import numpy as nparray = [1,2,3,4,5,6,7,8,9,10]np.max(array) - np.min(array)

Interquartile Range

Range being highly sensitive to outliers in the data, we introduce a more robust measure of spread which is an interquartile range
Difference between 25th percentile and 75th percentile of the data is called the interquartile range

Here Q1 is the 25th percentile and Q2 50th followed by Q3 75th

This is insensitive (relatively) to outliers as we are considering the spread of 25th percentile to 75th ignoring the lower and higher extremes

image credits: https://www.mathsisfun.com/definitions/interquartile-range.html

import numpy as nparray = [1,2,3,4,5,6,7,8,9,10]q75, q25 = np.percentile(array , [75,25])iqr = q75 - q25

Measure of variability

A quintessential statistical measure of Inferential Statistics, the concept of variance bleeds into (almost)every concept we discuss from here

Variance:

It is the measure of how much on average does each point in the dataset vary from the mean of the data.
Higher variance in the feature set represents that it is more discriminative in nature making it a good feature

In cases like ensemble learning where we are curating prediction’s made by each individual weak model, a higher variance among the different prediction makes it less confident of a prediction

Variance

import numpy as nparray = [1,2,3,4,5,6,7,8,9,10]np.var(array)

Standard deviation

The concept is similar to variance with a slight numerical alteration for convenience in quantifying the variability
It is the square root of the variance

import numpy as nparray = [1,2,3,4,5,6,7,8,9,10]np.std(array)

Distribution of Data

Distribution of Data: Whenever I refer to distribution of data simply picture the following

If we take a survey of a bunch of people asking them their age (I know pretty boring survey but anyhow) and store them as a list of numbers as below.

ages = [25,27,28,29,30,25,32,35,28,26,25,20,39,45,43,29,45,62,8 , 12 , 15 , 16 , 14 , 23 , 21]

Now make a plot out of it, such that I have ages on the x-axis and the number of time each of those occur or the frequency of the corresponding age on the y-axis then that’s a visual representation of the distribution of the data

Distribution of ages

We can also represent it mathematically as a function y= f(X), where x would a data point(in the above case age of a person) and y would be the frequency of that age
If we perform a simple operation, where we divide the function with the total number of data points. We would get the probability of that data point
probability = number of favorable outcomes/total number of outcomes
probability = Frequency of the favorable outcome/total number of outcomes

prob_ages = [0.04,0.04,0.04,0.04,0.04,0.04,0.08,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.12,0.04,0.04,0.08,0.08,0.04]

We call this new altered version of distribution as probability distribution p(x), Wherein if we plug in the data point it would spit out the probability of occurrence of that data point in the set and following is the visual representation of that.

Normal Distribution

Normal Distribution refers to one particularly interesting type of distribution that we study throughout statistics
Normal Distribution is as significant to statistics as pi is to geometry, Not only in the context of its academical importance but also due to the fact that it represents a pattern in how natural systems(demographics) work, it models many naturally-occurring phenomenons
Most (if not all) demographic data(like weight, height, marks) of the human population are normally distributed which makes the study of it quintessential, especially for fields like social research.
It is also known as a bell curve or Gaussian distribution

It has certain characteristics such as the following

Symmetry about the center(Where mean lies)
One bump in the middle (Uni model)
asymptotic (tails never touch x-axis), telling us rhetorically that nothing is impossible:)
similar mean median and mode.
Most values are concentrated around the mean

If the data has higher Standard deviation the shape of the distribution would be wider and if the mean is higher then the shape is high and vice versa as you can see in the below image

Image Credits: https://www.varsitytutors.com/hotmath/hotmath_help/topics/normal-distribution-of-data

A Normal Distribution with mean 0 and Standard Deviation 1 is called Standard normal Distribution

Normal Distribution mathematically can be parameterized as a function of standard deviation and mean(Derivation of this is out of the scope(my scope 😄) )

image credits: https://www.thoughtco.com/normal-distribution-bell-curve-formula-3126278

Note: Normal Distribution is not only philosophically a special concept but also heavily used because even distributions which are vaguely close to normal distribution could be approximated to it, as the result of which they possess similar characteristic’s or gives us the same outputs as we want it to, if it were indeed a normal distributions(Of course with slight modification), I guess it’s this flexible nature of molding another distribution’s into itself which makes it special

Z-Score and P-value

Since Normal Distribution is such a reoccurring pattern in the realm of statistics, it has been carefully and thoroughly studied as a result of which we have some literature/conventions about it at our disposal, which will make it easier for us to study any system which posses a Normal Distribution

As we have previously discussed if the distribution of data is given we could generate their probabilities easily, So what statisticians thoroughly studied a Standard Normal Distribution and for each position in the distribution they mapped it to it’s corresponding probability(this helps us in a lot of ways as we will discuss later) and provided us with the table to look up the probability for each point(instead of computing) Which is called the p-value.

This p-value we obtained indicates, that say we have a population with certain mean value and standard deviation, How likely(Probability) am I to find a particular value in that distribution

A use case could be, say I have BMI(Body Mass Index) readings of professional athletes that form a normal distribution. We have an individual of whom we are not sure which demographic he/she belongs to and we need to make a decision solely based on his/her BMI of whether that person belongs to professional athletes, in that case, we can use the p-value (which give us likelihood that he/she belongs to the athletes in the form of probability)

But there is a problem with it which is though many datasets may have normal distribution, every dataset is different(different mean and different standard deviation), So instead of giving the probabilities based on the numerical value of the dataset we have to come up with a standardized way, which is the “relative position” of the points with respect to the normal distribution.

Let us consider the data below, here normal_10 is an array which contains numbers with mean 10, possessing following distribution

import numpy as np import seaborn  as sns
normal_10 = np.random.normal(10, 1, 1000000)
sns.distplot(normal_10);

normal 10

Let us consider another dataset normal_100 is another array of numbers which has mean 100 and posses following Normal Distribution

import numpy as np import seaborn  as sns
normal_100 = np.random.normal(100, 1, 1000000)
sns.distplot(normal_100);

normal_100

as we can see both normal_10 and normal_100 are different datasets but both are Normally Distributed, but if we observe both the distributions, the probability for corresponding positions in both the distributions is the same(positions of 6 and 96 are same and we can see that their corresponding probabilities are also the same)

Hence we can conclude that irrespective of what the dataset is that we are dealing with if it is normally distributed then based on the position of each point in the dataset we can predict the probability

So now our task is to devise a technique that would give us a numerical value representing the relative position. For this, we leverage one of the characteristics discussed above which is the symmetry of the Normal Distribution with respective to mean, therefore the relative position of a point is found by measuring how far is it from the mean but we have another problem which is, based on the scale of each dataset the distance will vary but the thing is probability values which we have are for a Standard Normal Distribution(meaning Normal distribution with mean 0 and standard deviation 1),

The Normal Distribution we are dealing with may not be this, to fix this problem We use the same old rule if 5apples for 100 rupees then how much for 1apple ?..100/20, in here we divide the relative position from mean of the distribution with the standard deviation of the distribution giving us what the relative position from mean would have been if the Standard deviation was 1, So this relative position of a point with respect to mean divided by its standard deviation gives us a score called Z-score

Z-Score

image credits: http://www.z-table.com/z-score-formula.html

Let us see this in action, Let’s say you are a statistics teacher and you have just finished correction and on the next day a student comes in person and asks for his marks, but you can not reveal as they would be officially given day after tomorrow since you do not want to disappoint him, You instead decided to tell him how well he performed relative to other students in the class which statistically means you wanted to find the position of his marks in the normal distribution of entire class’s marks, So you calculated the Z-score of his marks and found out the p-value to be to 0.65 and his marks are above the mean, in that case, you would tell him you score better than 64% of the class.

Now to solve our initial problem of determining a person demographic (into an athlete or not ) we can calculate that individual’s Z-score based on mean and Standard deviation of a group of athletes and if the p-value is above a threshold say (95%) then we can consider him/her to be an athlete.

Questions Z-score help us answer

Given a data point or a value, How likely are we to find this in the distribution we have?
or Having a data point and a bunch of distribution in which distribution are we more likely to find it ?

from scipy import statsarray = [1,2,3,4,5,6,7,8,9,10]
#returns Z score of all the elements in the list
stats.zscore(array)from scipy.stats import ttest_1samp
# Student makrs
test_num = 
# Marks of all students in the class
sample = []
tset, pval = ttest_1samp(test_num, sample)

Confidence Interval

As we have seen that given a value of a distribution we can find out how likely is to occur, now say I have reframed my problem now I want to find out given a p-value of say 95% and mean, Standard deveation, what are the range of values which i am likely to find

Let us see a case where we would want to use this concept

Let’s say we have a population whose BMI is 21, We might want to know what is the range of BMI values which are 95% likely to occur if we take a sample from that population.

The calculation is pretty straight forward it’s just rearranging the formula above (for obtaining Z-score) but here the unknown being x

First, we calculate Z-score for the corresponding p-value(here 95%) multiple it with standard deviation and add the mean value of population to get one limit of the range and subtract it with mean value to get another limit of the range.

Confidence Intervals formula

from scipy.stats import sem, t
from scipy import mean
confidence = 0.95
data = [1, 2, 3, 4, 5]n = len(data)
m = mean(data)
std_err = sem(data)
h = std_err * t.ppf((1 + confidence) / 2, n - 1)first = m - h
print (start)
#OUTPUT 1.03675683852end = m + h
print (end)
#OUTPUT 4.96324316148

Source for the code

Correlation

It’s a statistical measure of how linearly correlated two groups of data are

Till now we only dealt with single variables
Here we would be dealing with two, in fact, correlation is a measure of how these two distributions or groups are related to each other
In this post, we will use a correlation coefficient named Pearson’s R, which is used when the data types of two distributions are interval or ratio in nature

Let us look at the following data

Age = [43,21,25,42,57,59]Glucose_level = [99,65,79,75,87,81]

Here we can see that older an individual is, higher is the glucose level

Correlation is a measure which helps us quantify the above statements

r: correlation coefficient/Pearson’s r

-1<=r<=1

-1: Indicates two features are highly negatively correlated, inversely proportional

0: Indicates no correlation at all

1: Indicates two features are highly positively correlated, directly proportional

image credits: http://faculty.cas.usf.edu/mbrannick/regression/corr1.html

Age = [43,21,25,42,57,59]   
Z_Age= [0.12748537, -1.40233902, -1.12418913,  0.05794789,  1.10100998,1.24008492]Glucose_level = [99,65,79,75,87,81]Z_Glucose_level = [ 1.72145713, -1.53018411, -0.19127301,-0.57381904,  0.57381904,0]

Intuition

If we convert values in each of the distribution to Z-scores then As we know Z-score of the number gives us the relative position of each point with respective mean, so we have data points transformed into a bunch of negative and positive numbers with magnitude depending on how far away from the mean they are and sign depending on which side of mean they are
if we have two groups which are positively correlated(directly proportional) then, a negative Z-score in one of the groups would be similar in sign(meaning the same side of the mean) with the corresponding value to it in the second group similarly all the corresponding elements would have a similar sign with more or less same degree of magnitude
If we have a negative correlation(inversely proportional) then the converse would be true for a value which is on the left extreme of one group would have the corresponding element on the other group at the right extreme
If we multiplied two sets of numbers with corresponding elements of equal magnitude and same sign then we would get a number say x, which is a high positive number (relatively ), This is the case of positive correlation.
If we multiplied two sets of numbers with corresponding elements of equal magnitude and opposite sign then we would get a number say y, which us a high negative number(relatively as well), this is the case of negative correlation.
If we multiplied two sets of random numbers then we would get a number closer to zero as positive and negative numbers would cancel out each other this is the case of no correlation.

When correlation does not give us the full picture

Case-1(Truncated range problem)

When two sets of the number we are comparing have less variance, which means they do not vary much among themselves, then we get a correlation coefficient of lower magnitude which might actually be misleading
Because those two numbers might be highly correlated but because of less variance in them they vary less from their mean giving low Z-score and this leading to a low correlation coefficient

Case-2(Non-linear relation)

When two set’s of numbers are non-linearly correlated, like below

Image Credits: https://www.emathzone.com/tutorials/basic-statistics/linear-and-non-linear-correlation.html

Here in the second case, we can see that y and x-axis are correlated but not in a linear fashion, Our method of calculation can not encapsulate this type of correlation
So it is better to visualize to list of numbers on a scatter plot before coming to conclusion about no correlation after viewing the low magnitude of the correlation coefficient

Note:

Often most misused tool/measure of statistics is Correlation Coefficient, where people show a correlation between two events and try to infer causation, which may or may not be true,

For example,

Image Credits: http://davidaking.blogspot.com/2012/09/we-need-to-import-more-lemons-from.html

here we can see as more lemons are imported less is the highway fatality rate so there exists a correlation but does that mean either lemons being imported is causing the fatality rate to decrease or decrease in highway fatality is causing more lemon imports No (right..?)

Example -2 , if we take the data regarding the population of homeless and crime rate, We observe that there is a correlation, which means we can see as the homeless population increases crime rate increases but based on this observation alone we can not statistically make an assertion or come to conclusion that either homelessness causes crime or more crime cases homelessness

In the first example, it was intuitive to reject causation but in cases like the second one it may be tempting intuitively just to infer causation but we shouldn’t without digging deeper.

“Correlation does not mean causation”

Note-2

Other correlation coefficients

As we have seen some conditions which a pair of variables must comply to in the context of their type in order for us to calculate person’s coefficient which is (interval or ratio), But what if our variables are ordinal or something else for which we have other coefficient’s, all though almost all of them use the same underlining principle of using the variances
Point Biserial
When one of our variables is a continuous variable (i.e., measured on an interval or ratio scale) and the other is a two-level categorical (a.k.a. nominal) variable (also known as a dichotomous variable), we need to calculate a point-biserial correlation coefficient
Phi
Sometimes researchers want to know whether two dichotomous variables are correlated. In this case, we would calculate a phi coefficient (Φ),
Spearman rho
Sometimes data are recorded as ranks. Because ranks are a form of ordinal data, and the other correlation coefficients discussed so far involve either continuous (interval, ratio) or dichotomous variables, we need a different type of statistic to calculate the correlation between two variables that use ranked data. In this case, the Spearman rho

Questions which correlation helps us answer

Are the two variables related to each other if yes, then how?
Does the change in the value of the one variable show consistent change in others too?
Is there a consistent pattern of how the two variables change?
What is the strength of the relationship?
How are they related?
Does the relationship found between these two variables significant? enough for us to take it into account? (We will get to this in the next post on inferential Statistics)
How much(percentage) variance of one variable is explained by the other variable?

import numpy as np
array_1 = [1,2,3,4,5]
array_2 = [10,9,8,7,6]
np.corrcoef(array_1 , array_2)[0, 1]

Source for the code

Variance Explained

People who intuitively understands what it means can skip, This idea of one variable explaining the variance of other did not intuitively strike me the first time I read it, So having had that experience I will try my best to get you to feel of what it means

Here variance just means a change in the value of a variable. Let us take an example say we take a reading of 3 quantities, let us assume all of them are not constant in that period of time/or the scale we are measuring them in allows the numerical values of those measures to demonstrate significant change or significant Variance(You get the point). So when I refer to variance of List-1, I mean the change in the values of List-1

Your weight for a period of time(Dist-1)
The number of calories you are burning per day(Dist-2)
Your grades for the same period of time(Dist-3)

Here The variance of Dist-1 is not explained or justified by Dist-3, Meaning the way in which the variance of two distributions behaves does not demonstrate a pattern or we can not justify the change in the measure of weight this month (relative to previous month) by observing the grades of this month, So we say grades do not explain the variance of weight and vice versa

Whereas If we take Dist-2 and Dist-1, we can clearly do that, Say if we observe a substantial change(in either low or high) in weight this month, we can look at the calories burned this month and say here this is the reason so if we can at least mathematically reason the change/variance of one variable with the help of another then we say, one variable explains the variance of another, I say at least mathematically because in our case it also makes intuitive sense that increase in weight would be because of fewer calories burned but it might not always be the case but even then if numbers behave as they do then it is good enough for us to say state variance explained

Check out my next post on Inferential Statistics

Feel free to follow me on Linkedin, Twitter, GitHub

Descriptive Statistics for Data Science

Contents

Introduction to Statistics

Context

Terminology

Descriptive vs Inferential Statistics

Data Types

Methods of Sampling

Descriptive Statistics

Measure of Central Tendency

Measure of Spread and Variability

Measure of variability

Distribution of Data

Normal Distribution

Z-Score and P-value

Confidence Interval

Correlation

Variance Explained

Written by Sai krishna