# Practical Statistics with Python (1): Distributions, Theorem and Confidence Intervals

*This set of articles I wrote to further solidify my understanding of these concepts on my path to learning Data Analytics. This is not meant for everyone, but I’m making it public anyway so I can be corrected if necessary and learn from it. Concepts presented here will be concise and direct. Since this is from a Data Analytics perspective, you won’t find most of the actual functions/equations behind the concepts here, and assumes some prerequisite knowledge on Statistics, Python and other related concepts — think of it as a simple cheat sheet.*

A common distribution that we work with is the **Normal **or **Gaussian distribution**, commonly manifested as the bell curve. Specifically, this distribution has:

- mean = median = mode
- Symmetry about the center
- Fairly equally distributed values on both sides of the mean

**Notation**

- Mean:
*μ* - Std. Deviation: σ
- Variance: σ ²
- Regression Coefficient: β

**Sampling Distributions**

Sampling distributions describe the distribution for a specific statistic. Sampling distributions help us create conclusions — using a statistic — about a population. For an excellent, basic example that ‘cracked the code’ for me, check out Khan Academy’s video:

**Sampling distributions | Statistics and probability | Math | Khan Academy**

*A sampling distribution shows every possible result a statistic can take in every possible sample from a population and…*www.khanacademy.org

**Sampling Distributions with Python**

*Bootstrapping *is a method of performing sampling, wherein samples are taken for experimentation, and then put back into the dataset to be picked again — sampling with replacement.

import numpy as np

students = np.array([1,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0])

#Perform some sampling for a specific parameter, in this case, the mean.

sample_props = []

for _ in range(10000):

#Notice we are using "replace=True" to put samples back.

sample = np.random.choice(students, 5, replace=True)

sample_props.append(sample.mean())

#Do something with sample_props

...

**Law of Large Numbers**

The larger our sample size, the closer our statistic gets to our parameter.

That is, the more often we test our sample data, the closer it gets to a certain value — the parameter. If this is hard to picture, see real-world examples presented here and here.

**Law of Large Numbers with Python**

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

# This is just setting up some random data in pop_data

# The functionality of the gamma distribution is not relevant

pop_data = np.random.gamma(1,100,3000)

plt.hist(pop_data);

#Population Mean

pop_data.mean()

> 100.35978700795846

#Simulate 5, 20 and 100 draws

random = np.random.choice(pop_data, size = 5).mean()

> 179.26530134867892

random = np.random.choice(pop_data, size = 20).mean()

> 136.39781572551306

random = np.random.choice(pop_data, size = 100).mean()

> 93.421142691106638

#As you can see, the larger the sample size, the closer we get to the population mean.

**The Central Limit Theorem**

With a large enough sample size, the sampling distribution of themeanwill be normally distributed

There are a few well known statistics this applies to (but not all):

- Sample means
- Sample proportions
- Difference in sample means
- Difference in sample proportions

**Central Limit Theorem with Python**

#First we'll look at what our data's graph looks like

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

np.random.seed(42)

pop_data = np.random.gamma(1,100,3000)

plt.hist(pop_data);

#That doesn't look normally distributed. Lets try a sample size of #3, and simulate that 10000 times

means_size_3 = []

for x in range(10000):

mean = np.random.choice(pop_data, size = 3).mean()

means_size_3.append(mean)

means_size_3 = np.array(means_size_3)

plt.hist(means_size_3);

#Let's try a sample size of 100

means = []

for x in range (10000):

mean = np.random.choice(pop_data, size = 100).mean()

means.append(mean)

means = np.array(means)

plt.hist(means)

**Confidence Intervals**

Confidence intervals are ‘intervals’ that we can create to guess — with a certain degree of accuracy — where a parameter lies. In the real world, working with an entire populations data can be slow and heavy, but we can use sampling distributions to estimate what a population parameter most probably is. The general process is as follows:

- Get your data
- Figure out what you want to estimate (ex. average height in the USA)
- Bootstrap that parameter
- Create Confidence intervals.

** Confidence Interval Width**: The distance between the upper and lower bounds of the confidence interval

** Margin of Error**: Confidence Interval Width / 2.

You will often see examples that say something like this:

*Group A scored 50 (+/- 3) points.*

The +/- is the margin of error.

Note: To create confidence intervals, you need to “cut off” parts of the graph at two points. For example, for a 95% confidence interval, you “cut off” 2.5% on the right and 2.5% on the left. For 99%, you “cut off” 0.5% on each side. The below example might help.

**Confidence Intervals with Python**

import pandas as pd

import numpy as np

coffee_full = pd.read_csv('somerandomcoffeedataset.csv')

#this is the only data you might actually get or be able to use in the real world.

coffee_red = coffee_full.sample(200)

#Creating a sample distribution for the average height of non-coffee drinkers

means = []

for x in range(10000):

boot = coffee_red.sample(200, replace = True)

no_coffee = boot[boot["drinks_coffee"] == False]

means.append(no_coffee["height"].mean())

plt.hist(means);

#Creating a 95% confidence interval

np.percentile(means, 2.5), np.percentile(means, 97.5)

> (66.002975632695495, 67.583393357864637)

#What that result says, is that the population average height of #non-coffee drinkers is probably between 66.00 and 67.58.

#If we check the population average:

coffee_full[coffee_full["drinks_coffee"] == False]["height"].mean()

> 66.443407762147004

#66.44 is indeed between those two intervals, which means we #correctly estimated that the population average will be in between #those two numbers.

Up next is Hypothesis Testing, A/B Testing, and Regression Analysis.

**Resources for Further Study**