Practical Statistics with Python (1): Distributions, Theorem and Confidence Intervals

This set of articles I wrote to further solidify my understanding of these concepts on my path to learning Data Analytics. This is not meant for everyone, but I’m making it public anyway so I can be corrected if necessary and learn from it. Concepts presented here will be concise and direct. Since this is from a Data Analytics perspective, you won’t find most of the actual functions/equations behind the concepts here, and assumes some prerequisite knowledge on Statistics, Python and other related concepts — think of it as a simple cheat sheet.


A common distribution that we work with is the Normal or Gaussian distribution, commonly manifested as the bell curve. Specifically, this distribution has:

  • mean = median = mode
  • Symmetry about the center
  • Fairly equally distributed values on both sides of the mean
Normal Distribution for a coffee-drinking user data-set

Notation

  • Mean: μ
  • Std. Deviation: σ
  • Variance: σ ²
  • Regression Coefficient: β

Sampling Distributions

Sampling distributions describe the distribution for a specific statistic. Sampling distributions help us create conclusions — using a statistic — about a population. For an excellent, basic example that ‘cracked the code’ for me, check out Khan Academy’s video:

Sampling Distributions with Python

Bootstrapping is a method of performing sampling, wherein samples are taken for experimentation, and then put back into the dataset to be picked again — sampling with replacement.

import numpy as np
students = np.array([1,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0])
#Perform some sampling for a specific parameter, in this case, the mean.
sample_props = []
for _ in range(10000):
#Notice we are using "replace=True" to put samples back.
sample = np.random.choice(students, 5, replace=True)
sample_props.append(sample.mean())
#Do something with sample_props
...

Law of Large Numbers

The larger our sample size, the closer our statistic gets to our parameter.

That is, the more often we test our sample data, the closer it gets to a certain value — the parameter. If this is hard to picture, see real-world examples presented here and here.

Law of Large Numbers with Python

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# This is just setting up some random data in pop_data
# The functionality of the gamma distribution is not relevant
pop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);
#Population Mean
pop_data.mean()
> 100.35978700795846
#Simulate 5, 20 and 100 draws
random = np.random.choice(pop_data, size = 5).mean()
> 179.26530134867892
random = np.random.choice(pop_data, size = 20).mean()
> 136.39781572551306
random = np.random.choice(pop_data, size = 100).mean()
> 93.421142691106638
#As you can see, the larger the sample size, the closer we get to the population mean.

The Central Limit Theorem

With a large enough sample size, the sampling distribution of the mean will be normally distributed

There are a few well known statistics this applies to (but not all):

  1. Sample means
  2. Sample proportions
  3. Difference in sample means
  4. Difference in sample proportions

Central Limit Theorem with Python

#First we'll look at what our data's graph looks like
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(42)
pop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);
#That doesn't look normally distributed. Lets try a sample size of #3, and simulate that 10000 times
means_size_3 = []
for x in range(10000):
mean = np.random.choice(pop_data, size = 3).mean()
means_size_3.append(mean)
means_size_3 = np.array(means_size_3)
plt.hist(means_size_3);
Not quite a normal distribution. The central limit theorem hasn’t kicked in yet.
#Let's try a sample size of 100 
means = []
for x in range (10000):
mean = np.random.choice(pop_data, size = 100).mean()
means.append(mean)
means = np.array(means)
plt.hist(means)
Looks like a sample size of 100 is big enough to show the CLT for this data set.

Confidence Intervals

Confidence intervals are ‘intervals’ that we can create to guess — with a certain degree of accuracy — where a parameter lies. In the real world, working with an entire populations data can be slow and heavy, but we can use sampling distributions to estimate what a population parameter most probably is. The general process is as follows:

  • Get your data
  • Figure out what you want to estimate (ex. average height in the USA)
  • Bootstrap that parameter
  • Create Confidence intervals.

Confidence Interval Width: The distance between the upper and lower bounds of the confidence interval

Margin of Error: Confidence Interval Width / 2.

You will often see examples that say something like this:

Group A scored 50 (+/- 3) points.

The +/- is the margin of error.

Note: To create confidence intervals, you need to “cut off” parts of the graph at two points. For example, for a 95% confidence interval, you “cut off” 2.5% on the right and 2.5% on the left. For 99%, you “cut off” 0.5% on each side. The below example might help.

Confidence Intervals with Python

import pandas as pd
import numpy as np
coffee_full = pd.read_csv('somerandomcoffeedataset.csv')
#this is the only data you might actually get or be able to use in the real world.
coffee_red = coffee_full.sample(200)
#Creating a sample distribution for the average height of non-coffee drinkers
means = []
for x in range(10000):
boot = coffee_red.sample(200, replace = True)
no_coffee = boot[boot["drinks_coffee"] == False]
means.append(no_coffee["height"].mean())
plt.hist(means);
#Creating a 95% confidence interval
np.percentile(means, 2.5), np.percentile(means, 97.5)
> (66.002975632695495, 67.583393357864637)
#What that result says, is that the population average height of #non-coffee drinkers is probably between 66.00 and 67.58.
#If we check the population average:
coffee_full[coffee_full["drinks_coffee"] == False]["height"].mean()
> 66.443407762147004
#66.44 is indeed between those two intervals, which means we #correctly estimated that the population average will be in between #those two numbers.

Up next is Hypothesis Testing, A/B Testing, and Regression Analysis.

Resources for Further Study