Practical Statistics with Python: Distributions, Theorem and Confidence Intervals

A quick intro to working with statistics from a data analysis perspective.

Dhruv B
5 min readJan 1, 2019

This set of articles (Practical Statistics with Python) I wrote to further solidify my understanding of these concepts on my path to learning Data Analytics. This is not meant for everyone, but I’m making it public anyway so I can be corrected if necessary and learn from it. Concepts presented here will be concise and direct. Since this is from a Data Analytics perspective, you won’t find most of the actual functions/equations behind the concepts here, and assumes some prerequisite knowledge on Statistics, Python and other related concepts — think of it as a simple cheat sheet.

The Normal Distribution

A common distribution that we work with is the Normal or Gaussian distribution, commonly manifested as the bell curve. Specifically, this distribution has:

  • mean = median = mode
  • Symmetry about the center
  • Fairly equally distributed values on both sides of the mean
Normal Distribution for a coffee-drinking user data-set

Common Notation

  • Mean: μ
  • Std. Deviation: σ
  • Variance: σ ²
  • Regression Coefficient: β

Sampling Distributions

Sampling distributions describe the distribution for a specific statistic. That is, sampling distributions are a subset (sample) of the full data set, with which you can play, explore and simulate statistics like averages, variance and skew.

Sampling distributions help us create conclusions — using a statistic — about a population. For an excellent, basic example that ‘cracked the code’ for me, check out Khan Academy’s video:

Sampling Distributions with Python

Bootstrapping is a method of performing sampling, wherein samples are taken for experimentation, and then put back into the data set to be picked again — this is also described as “sampling with replacement”.

import numpy as npstudents = np.array([1,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0])#Perform some sampling for a specific parameter, in this case, the mean.
sample_props = []
for _ in range(10000):
#Notice we are using "replace=True" to put samples back.
sample = np.random.choice(students, 5, replace=True)
sample_props.append(sample.mean())
#Do something with sample_props
...

Law of Large Numbers

The larger our sample size, the closer our statistic gets to our parameter.

That is, the more often we test our sample data, the closer it gets to a certain value — the parameter. If this is hard to picture, see real-world examples presented here and here. In addition, the example below might help.

Law of Large Numbers with Python

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline# This is just setting up some random data in pop_data
# The functionality of the gamma distribution is not relevant
pop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);
#Population Mean, used as a baseline.
pop_data.mean()
> 100.35978700795846
#Simulate 5, 20 and 100 "draws", or "samples"
random = np.random.choice(pop_data, size = 5).mean()
> 179.26530134867892
random = np.random.choice(pop_data, size = 20).mean()
> 136.39781572551306
random = np.random.choice(pop_data, size = 100).mean()
> 93.421142691106638
#As you can see, the larger the sample size, the closer we get to the population mean.

The Central Limit Theorem

With a large enough sample size, the sampling distribution of the mean will be normally distributed

There are a few well known statistics this applies to (but not all):

  1. Sample means
  2. Sample proportions
  3. Difference in sample means
  4. Difference in sample proportions

Central Limit Theorem with Python

#First we'll look at what our data's graph looks like
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinepop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);
#That doesn't look normally distributed. Lets try a sample size of #3, and simulate that 10000 timesmeans_size_3 = []
for x in range(10000):
mean = np.random.choice(pop_data, size = 3).mean()
means_size_3.append(mean)
means_size_3 = np.array(means_size_3)
plt.hist(means_size_3);
Not quite a normal distribution. The central limit theorem hasn’t kicked in yet. Lets try another sample, perhaps a bigger one?
#Let's try a sample size of 10000 
means = []
for x in range (10000):
mean = np.random.choice(pop_data, size = 100).mean()
means.append(mean)
means = np.array(means)
plt.hist(means)
Looks like a sample size of 10000 is big enough to show the CLT for this data set.

Confidence Intervals

Confidence intervals are ‘intervals’ that we can create to guess — with a certain degree of accuracy — where a parameter of interest lies. In the real world, working with an entire populations data can be slow and heavy, but we can use sampling distributions to estimate what a population parameter most probably is. The general process is as follows:

  • Get your data
  • Figure out what you want to estimate (ex. average height in the USA)
  • Bootstrap that parameter
  • Create Confidence intervals.

Confidence Interval Width: The distance between the upper and lower bounds of the confidence interval.

Margin of Error: Confidence Interval Width / 2.

You will often see examples that say something like this:

Group A will probably score 50 (+/- 3) points.

The +/- is the margin of error.

Note: To create confidence intervals, you need to “cut off” parts of the graph at two points. For example, for a 95% confidence interval, you “cut off” 2.5% on the right and 2.5% on the left. For 99%, you “cut off” 0.5% on each side. The below example might help.

Confidence Intervals with Python

import pandas as pd
import numpy as np
coffee_full = pd.read_csv('somerandomcoffeedataset.csv')#this is the only data you might actually get or be able to use in the real world.
coffee_red = coffee_full.sample(200)
#Creating a sample distribution (by bootstrapping) for the average height of non-coffee drinkersmeans = []
for x in range(10000):
boot = coffee_red.sample(200, replace = True)
no_coffee = boot[boot["drinks_coffee"] == False]
means.append(no_coffee["height"].mean())
plt.hist(means);
#Creating a 95% confidence interval
np.percentile(means, 2.5), np.percentile(means, 97.5)
> (66.002975632695495, 67.583393357864637)
#What that result says is that the population average height of #non-coffee drinkers is probably between 66.00 and 67.58.
#If we check the population average:
coffee_full[coffee_full["drinks_coffee"] == False]["height"].mean()
> 66.443407762147004
#66.44 is indeed between those two intervals, which means we #correctly estimated that the population average will be in between #those two numbers.

See also: Hypothesis Testing. I will later write articles on A/B Testing and basic Regression Analysis.

Resources for Further Study

--

--