Brief Introduction to Statistical Analysis using Python

Ayşe Bat
My Data Science Journey
8 min readMar 8, 2019

Statistical thinking is really important for data analysis, in this article I will emphasize the core of the statistical analysis.

Graphical Exploratory Data Analysis

  • Exploratory Data Analysis (EDA)
  • Empirical cumulative distribution functions (ECDF)

Quantitative Exploratory Data Analysis

  • Mean
  • Median
  • Variance and standard deviation
  • Covariance, and Pearson Correlation Coefficient

Graphical Exploratory Data Analysis

Exploring your data (EDA) is a crucial step in your analysis.

  • Organizing
  • Plotting
  • Computing a few numerical summaries

For EDA, we are going to explore the 2008 US swing state election result which is at country level in each of the three major swing states of Pennsylvania, Ohio, and Florida.

We take the Democratic share (dem_share) of the vote in the counties of all the three swing state and plot them as a histogram. We are interested in the fraction of the vote to Brack Obama in the county.

#We can specified where the edge of the bars of histigrams are using the bins keyword
# 20 evenly spaced bins will create
_ = plt.hist(swing_state['dem_share'], bins =20)
#labeling the plot
_ = plt.xlabel('Percent of vote for Obama')
_ = plt.ylabel('Number of counties')
plt.show()

With the bee swarm plot, we could plot the vote totals in three swing states. Each point in the plot represents the share of the vote Obama got into a single count. The y-axis gives us quantitative information.

sns.swarmplot(x=’state’, y=’dem_share’, data=swing_state)
plt.xlabel(‘state’)
plt.ylabel(‘percent of vote for Obama’)
plt.show()
  • We can clearly see that Obama got less than 50% of the vote in the majority of counties in each of the three swing states.

Empirical cumulative distribution functions (ECDF): We are going to plot ECDF for the percentage of the vote for the Obama. The x-axis is a percent of the vote that vote for Obama. The y-axis is the fraction of data point that has a value smaller than the corresponding x-value.

#We can use np.sort() to generate our x-data
x = np.sort(swing_state['dem_share'])
#y-axis is evenly speaced data point with a max 1.
y = np.arange(1, len(x) +1)/len(x)
_ = plt.plot(x,y, marker ='.', linestyle='none')
plt.xlabel('Percent of Vote for Obama')
plt.ylabel('ECDF')
#Keep the data off from plot edge
plt.margins(0.02)
plt.show()

75% of counties in swing states had 50% or less of its people vote for Obama

Quantitative Exploratory Data Analysis

Mean: If you have a sample of n values, xi, the mean, μ, is the sum of the values divided by the number of values:

Outliers: Data point whose value is far greater or less than most of the rest of the data.

  • The mean was highly affected by outliers.

The median: is the middle value of the dataset and is immune to data that have extreme values. The median is the special name for the 50th percentile. Similarly, the 25th percentile is the value of the data point that is greater than 25% of the sorted data.

print('Mean: ', np.mean(swing_state['dem_share']))
print('Median: ',np.median(swing_state['dem_share']))
Mean: 43.76441441441444
Median: 43.185
np.percentile(swing_state['dem_share'], [25,50,75])
array([37.3025, 43.185 , 49.925 ])

The Box plot was invented by John Tukey to display some of the statistical features of the data set based on percentiles.

Here we look at the box plot for percent of the vote for Obama for different states.

  • The center of boxes are shown the median: 50th
  • The edges of the boxes are shown the 25th and 75th percentile
  • The total height of the box contains 50% of the data and it is called interquartile range or IQR
  • The whiskers extend the distance of 1.5 times IQR to the extent a distance of the data.
  • All points outside of the whiskers are called outliers.

Let’s plot ECDF with the percentile points on it.

x = np.sort(swing_state['dem_share'])
y = np.arange(1, len(x) +1) /len(x)
percentiles = np.array([2.5, 25, 50, 75, 97.5])
_ = plt.plot(x,y, marker ='.', linestyle='none')
plt.xlabel('percent of vote for Obama')
plt.ylabel('ECDF')
_ = plt.plot(per_vers_x, percentiles/100, marker='D', color='red', linestyle ='none')
plt.show()

Variance and standard deviation

Variance: is intended to describe the spread of the data. The variance is calculated as the average of the squared distance from the mean in python with numpy we can calculate the variance np.var()

xi−μ is called deviation from mean so the variance is the mean squared deviation. The square root of the variance, σ, is called the standard deviation. Standard deviation (σ)is a reasonable metric for the spread of the data and calculated with np.std()

print('Variance: ', np.var(swing_state['dem_share']))
print('Standart deviation: ', np.std(swing_state['dem_share']))
Variance: 114.24649492735986
Standart deviation: 10.68861520157592

To see how the standard deviation affects the distribution. Let’s plot the normal distribution also known as Gaussian distribution to see how different standard deviations affect the distributions with same mean value.

#Draw 100,000 samples from a Normal distribution 
#First samplehas a mean of 20 and a standard deviation of 1
sample_std1 = np.random.normal(20, 1, size=100000)
#Second has standard deviations of 3
sample_std3 = np.random.normal(20, 3, size=100000)
#Third has standard deviations of 10
sample_std10 = np.random.normal(20, 10, size=100000)
#Plot a histograms of each of the samples; for each, use 100 bins
_ = plt.hist(sample_std1, density=True, histtype='step', bins=100)
_ = plt.hist(sample_std3, density=True, histtype='step', bins=100)
_ = plt.hist(sample_std10, density=True, histtype='step', bins=100)
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
_ = plt.title("The normal distribution")
plt.ylim(-0.01, 0.42)
plt.show()

Covariance, and Pearson Correlation Coefficient

Covariance is a measure of the tendency of two variables to vary together. If we have two series, X and Y, their deviations from the mean are

where μx is the mean of X and μy is the mean of Y. If X and Y vary together, their deviations tend to have the same sign.

If we multiply them together, the product is positive when the deviations have the same sign and negative when they have the opposite sign. So adding up the products gives a measure of the tendency to vary together.

Covariance is the mean of these products:

where n is the length of the two series (they have to be the same length).

We are going to make scatter plot for the percent of votes for Obama versus the total number of votes in each county.

mean_of_dem_share = np.mean(swing_state['dem_share'])
mean_of_total_vote =np.mean(swing_state['total_votes'])/1000
_ = plt.plot(swing_state['total_votes']/1000, swing_state['dem_share'], marker='.',
linestyle='none')
#drawing the mean of total votes
_ = plt.axvline(x=mean_of_total_vote, ymin=0, ymax= max(swing_state['total_votes']),
color='red', zorder=1, alpha=0.8 )
_ = plt.text(480,45,'Mean of percent for Obama', color='r')
#draw tha mean of percent of votes for Obama
_ = plt.axhline(y=mean_of_dem_share, xmin=0, xmax= max(swing_state['dem_share']),
color='red', zorder=1, alpha=0.8)
_ = plt.text(100, 10, 'Mean of total votes', color='r')
#distance from mean
_ = plt.vlines(x=220, ymin=44, ymax=65., color='g')
_ = plt.hlines(y=65, xmin=85, xmax=220, color='b')
_ = plt.text(100, 70, 'distance from mean of total votes', color='b')
_ = plt.text(205,60,'distance from mean of percent for Obama', color='g')
#_ =plt.axline()
#data point
_ = plt.xlabel('Total Votes (thousands)')
_ = plt.ylabel('Percent of Votes for Obama ')
plt.show()

if you look at the data point that we point out: Covariance is the mean of the product of differences which it calculates using the distance from a mean of percent for Obama and distance from a mean of total votes.

Positively Correlated: If x and y both tend to be above, or both below their respective means together then covariance is positive

Negatively Correlated: If x is high while y is low then covariance is negative or anti-correlated.

If we want to have a more generally applicable measure of how two variables depend on each other, we want to be dimensionless. We can divide the covariance by standard deviation (σ) of x and y variables.

Pearson’s correlation ( ρ ): Pearson’s correlation is always between −1 and +1 (including both). The magnitude indicates the strength of the correlation. If the correlation is equal to the 1 means that two variables are perfectly correlated. The same is true if the correlation is equal to the −1means that the variables are negatively correlated, but for purposes of prediction, a negative correlation is just as good as a positive one. So if the correlation is equal to the 0, does that mean there is no relationship between the variables? Unfortunately, Pearson’s correlation only measures linear relationships.

The covariance may be computed using the Numpy function np.cov(). For example, we have two sets of data x and y, np.cov(x, y) returns a 2D array where entries [0,1] and [1,0] are the covariances. Entry [0,0] is the variance of the data in x, and entry [1,1] is the variance of the data in y. This 2D output array is called the covariance matrix since it organizes the self- and covariance.

Covariance_matrix = np.cov(swing_state['dem_share'], swing_state['total_votes'])
print('Covariance matrix between percent of vote for Obama and total votes is\n', Covariance_matrix)
Covariance matrix between percent of vote for Obama and total votes is
[[1.14763447e+02 8.17309362e+05]
[8.17309362e+05 2.02451039e+10]]
# let's compare variance x from var and covariance matrix
print('1: Variance of X from np.var():', np.var(swing_state['dem_share']))
print('2: Variance of X from covariance matrix', Covariance_matrix[0,0])
1: Variance of y from np.var(): 20153909774.448013
2: Variance of y from covariance matrix 20245103936.323345
pearson_coefficient = np.corrcoef(swing_state['dem_share'], swing_state['total_votes'])
print(pearson_coefficient[0,1])
0.536197364958678

Final Thought

The statistical analysis is more detailed and using the solve more complex problem. We just write a basic concept of a statistic. This term we mentioned in this article any data analytic should know. I will continue to write about statistical analysis.

You may find this study in my githup account as part of Datacamp repository.

I have written this article to improve my data analytic skills so I am still a learner. Please let me know any additional information or comment on this article.

Follow me on Twitter, Linkedin or in Medium.

--

--