Descriptive Statistics Final Project — Python
Overview
Welcome to the Descriptive Statistics Final Project! In this project, you will demonstrate what you have learned in this course by conducting an experiment dealing with drawing from a deck of playing cards and creating a writeup containing your findings.
Be sure to check through the project rubric to self-assess and share with others who will give you feedback.
Questions for Investigation
This experiment will require the use of a standard deck of playing cards. This is a deck of fifty-two cards divided into four suits (spades (♠), hearts (♥), diamonds (♦), and clubs (♣)), each suit containing thirteen cards (Ace, numbers 2–10, and face cards Jack, Queen, and King). You can use either a physical deck of cards for this experiment or you may use a virtual deck of cards such as that found on random.org (http://www.random.org/playing-cards/).
For the purposes of this task, assign each card a value: The Ace takes a value of 1, numbered cards take the value printed on the card, and the Jack, Queen, and King each take a value of 10.
1. First, create a histogram depicting the relative frequencies of the card values.
2. Now, we will get samples for a new distribution. To obtain a single sample, shuffle your deck of cards and draw three cards from it. (You will be sampling from the deck without replacement.) Record the cards that you have drawn and the sum of the three cards’ values. Replace the drawn cards back into the deck and repeat this sampling procedure a total of at least thirty times.
3. Let’s take a look at the distribution of the card sums. Report descriptive statistics for the samples you have drawn. Include at least two measures of central tendency and two measures of variability.
4. Create a histogram of the sampled card sums you have recorded. Compare its shape to that of the original distribution. How are they different, and can you explain why this is the case?
5. Make some estimates about values you will get on future draws. Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? Make sure you justify how you obtained your values.
Investigation
1.Histogram depicting the relative frequencies of the card values.
Deck_Cards.csv contains deck of fifty-two cards divided into four suits, along with assigning each card a value: The Ace takes a value of 1, numbered cards take the value printed on the card, and the Jack, Queen, and King each take a value of 10)
Below is the format:
cards,suits,value
A,S,1
2,S,2
3,S,3
4,S,4
5,S,5
6,S,6
7,S,7
8,S,8
9,S,9
10,S,10
J,S,10
Q,S,10
K,S,10
A,H,1
2,H,2
3,H,3
4,H,4
5,H,5
6,H,6
7,H,7
8,H,8
9,H,9
10,H,10
J,H,10
Q,H,10
K,H,10
A,D,1
2,D,2
3,D,3
4,D,4
5,D,5
6,D,6
7,D,7
8,D,8
9,D,9
10,D,10
J,D,10
Q,D,10
K,D,10
A,C,1
2,C,2
3,C,3
4,C,4
5,C,5
6,C,6
7,C,7
8,C,8
9,C,9
10,C,10
J,C,10
Q,C,10
K,C,10
Load Data
import pandas as pd
import numpy as npdf = pd.read_csv("/Users/....../Documents/Deck_Cards.csv",header='infer')
print df.head(3)
print df.iloc[:,2].describe()Output:
cards suits value
0 A S 1
1 2 S 2
2 3 S 3
count 52.000000
mean 6.538462
std 3.183669
min 1.000000
25% 4.000000
50% 7.000000
75% 10.000000
max 10.000000
Name: value, dtype: float64
Plot Histogram
import matplotlib.pyplot as pltfig = plt.figure()
ax = fig.add_subplot(211)
ax.hist(df['value'],bins = 10,range=[0.5, 10.5],facecolor='g', align='mid')
ax.xaxis.set_ticks(np.arange(0, 12, 1))plt.title('Relative Frequencies')
plt.xlabel('Values')
plt.ylabel('Count')
plt.grid(True)ay = fig.add_subplot(212)
ay.boxplot(df['value'])plt.show()Output:
Show Frequency Table
freq_table = df.groupby(['value'])
print freq_table.size()Output:value
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
9 4
10 16
dtype: int64
2.Sampling and Sampling distribution for sum and average of the three cards’ values
from random import samplec1=[]
c2=[]
c3=[]
sum_s=[]
average_s=[]for i in xrange(1000):
rindex = np.array(sample(xrange(len(df)),3))
dfr = df.ix[rindex]
c1.append(dfr.iloc[0,2])
c2.append(dfr.iloc[1,2])
c3.append(dfr.iloc[2,2])
sum_s.append(dfr.iloc[0,2]+dfr.iloc[1,2]+dfr.iloc[2,2])
average_s.append((dfr.iloc[0,2]+dfr.iloc[1,2]+dfr.iloc[2,2])/(3.0))sampling_df = pd.DataFrame({'card1':c1,'card2':c2,'card3':c3,'sum_col':sum_s,'average_col':average_s})print sampling_df.head(3)Output:average_col card1 card2 card3 sum_col
0 7.000000 7 10 4 21
1 5.333333 4 10 2 16
2 7.000000 5 7 9 213.Distribution of the card sums. Descriptive statistics for the samples drawn.(At least two measures of central tendency and two measures of variability.)print sampling_df['sum_col'].describe()Output:count 1000.000000
mean 19.646000
std 5.328703
min 5.000000
25% 16.000000
50% 20.000000
75% 23.000000
max 30.000000
Name: sum_col, dtype: float644.Histogram of the sampled card sums and comparison of its shape to original distributionfig = plt.figure()
ay = fig.add_subplot(211)
ay.hist(sampling_df['sum_col'],bins = 25,range=[4, 34],facecolor='g', align='mid')
#ay.xaxis.set_ticks(np.arange(0, 40, 5))plt.title('Sampling Distribution - SUM')
plt.xlabel('Values(SUM)')
plt.ylabel('Count')ay = fig.add_subplot(212)
ay.boxplot(sampling_df['sum_col'])plt.show()Output:
The shape of the sampled card sums is same as that of normal distribution and this is in accordance with Central Limit Theorem.5.Descriptive statistics and histogram of Sampling distribution for Average of the three cards’ valuesprint sampling_df['average_col'].describe()Output:count 1000.000000
mean 6.548667
std 1.776234
min 1.666667
25% 5.333333
50% 6.666667
75% 7.666667
max 10.000000
Name: average_col, dtype: float64fig = plt.figure()
ay = fig.add_subplot(211)
ay.hist(sampling_df['average_col'],bins = 40,range=[4, 34],facecolor='g', align='mid')
#ay.xaxis.set_ticks(np.arange(0, 40, 5))plt.title('Sampling Distribution - AVG')
plt.xlabel('Values(AVG)')
plt.ylabel('Count')ay = fig.add_subplot(212)
ay.boxplot(sampling_df['average_col'])plt.show()Output:
The shape of theSampling distribution for Average of the three cards’ values is same as that of normal distribution and this is in accordance with Central Limit Theorem.Also as perCentral Limit Theorem for Sampling distribution for Average of the three cards’ values,Population SD/SD of Sampling distribution for Average of the three cards’ values is square root of sample size3.183669/1.776234 = 1.7923702620262871 ~√ (3 )6.Within what range will you expect approximately 90% of your draw values to fall? What is the approximate probability that you will get a draw value of at least 20? (For sampled card sums distribution)90% of values will fall between the points corresponding to 5% and 95%.import scipy.stats as spmean_sum_distribution = 19.481000
variance_sum_distribution = 5.435364p1=sp.norm.ppf(0.05)*variance_sum_distribution+mean_sum_distributionp2=sp.norm.ppf(0.95)*variance_sum_distribution+mean_sum_distributionprint 'It is expected approximately 90% of the draw values will fall between ',p1,' and ',p2Output:It is expected approximately 90% of the draw values will fall between 10.5406218108 and 28.4213781892Manual Calculation by referring Z Table (For sampled card sums distribution):95% is 1.64
5% is -1.65
90% is between 95% and 5%Mean = 19.481Variance = 5.435364Converting Z values to X1.64*5.435364+19.481 = 28.39499696~28.4
-1.65*5.435364+19.481 = 10.512649400000003~10.51The approximate probability that you will get a draw value of at most 20p_20_atmost = sp.norm.cdf(z_20)print z_20print 'The approximate probability that we get a draw value of at least 20 is',(1-p_20_atmost)Output:0.0954857853126
The approximate probability that we get a draw value of at least 20 is 0.461964490173Manual Calculation by referring Z Table (For sampled card sums distribution):Z-score = (X-Mu)/SigmaZ-score for 20 = (20-19.481)/5.435364 = 0.09548578531263009, left of 20To get more than 20 i.e. atleast 201 - Probability(Z-score for 20) i.e 1 - 0.5359 = 0.46409999999999996~0.464References:Descriptive Statistics Final Project -https://docs.google.com/document/d/1059JMJ9C5dn7vKUrmfWYle57Ai3Uk9PzxPQBGj5drjE/pub?embedded=true
Python For Data Science Cheat Sheet - https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf
R-bloggers - https://www.r-bloggers.com/descriptive-statistics-final-project-with-python-r/
Cheat Sheet for Exploratory Data Analysis in Python - https://www.analyticsvidhya.com/blog/2015/06/infographic-cheat-sheet-data-exploration-python/
Z Table - https://s3.amazonaws.com/udacity-hosted-downloads/ZTable.jpg
Cheat sheet: Data Visualisation in Python - https://www.analyticsvidhya.com/blog/2015/06/data-visualization-in-python-cheat-sheet/