Data Analytics Using Python (Part_3)

Teena Mary
Budding Data Scientist
16 min readApr 17, 2020
Photo by Campaign Creators on Unsplash

This is the third post among the 12 series of posts in which we will learn about Data Analytics using Python. In this post, we will look into Python demo for distributions, sampling and sampling distributions, distribution of sample means, population and variance and also, confidence interval estimation for a single population.

Index

  1. Python Demo for Distributions
  2. Sampling and Sampling Distributions
  • Sampling Techniques
  • Sampling Distributions
  • Central Limit Theorem

3. Confidence Interval Estimation : Single Population

4. Finite Population

Python Demo for Distributions

We are going to explore the distribution by using examples. First we need to import the required libraries.

import scipy
import numpy as np

Binomial Distribution

from scipy.stats import binom

Q. A survey found that 65% of all financial customers were satisfied with their primary financial institution. Suppose that 25 financial consumers are sampled and if survey result still holds true today, what is the probability that exactly 19 are very satisfied with their primary financial institution?

Here, we can use binomial distribution since the outcome is either a success or failure. We use the following code for that:

print(binom.pmf(k=19,n=25,p=0.65))
#0.09077799859322791

Here, k is the x value, n is the sample size and p is the probability. We use binom.pmf because the probability of an exact value is asked. And we see that there is a 9% probability that exactly 19 customers are satisfied with their primary financial institution.

Q. According to US census bureau, approximately 6% of all workers in Jackson, Mississippi, are unemployed. In conducting a random telephone survey in Jackson, what is the probability of getting two or fewer unemployed workers in a sample of 20?

Here, we need to find the probability of a getting two or fewer unemployed workers in the sample. So, in this case we use binom.cdf, which is the cumulative distribution function.

binom.cdf(2,20,0.06)
#0.8850275957378545

Here, 2 is the x value, 20 is the sample size and 0.06 is the probability. So, there is 88% chance of getting two or fewer unemployed workers in a sample of 20.

Poisson Distribution

from scipy.stats import poisson

Q. Suppose bank customers arrive randomly on weekday afternoons at an average of 3.2 customers every 4 minutes. What is the probability of exactly 5 customers arriving in a 4 minute interval on a weekday afternoon?

This is a case where the arrival rate is Poisson distributed. Here, mean is 3.2 customers and the x value is 5.

poisson.pmf(5,3.2)
#0.11397938346351824

So, there is 11% chance that exactly 5 customers will arrive in a 4 minute interval on a weekday afternoon.

Q. Bank customers arrive randomly on weekday afternoons at an average of 3.2 customers every 4 minutes. What is the probability of having more than 5 customers arriving in a 4 minute interval on a weekday afternoon?

Since we need to find the probability of having more than 5 customers in a 4 minute interval with an average of 3.2 customer is 4 minute interval, we use poisson.cdf. Then we will get the ‘less than’ value. We subtract that value from 1 to get the value we need.

prob=poisson.cdf(7,3.2)
prob_more_than7=1-prob
prob_more_than7
#0.01682984174895752

So, we see that there is only 1% probability of having more than 5 customers arriving in a 4 minute interval on a weekday afternoon.

Q. A bank has an average arrival rate of 3.2 customers every 4 minutes. What is the probability of getting exactly 10 customers in a 8 minute interval?

Here, for 4 minute interval, the average is 3.2 customers. So, for 8 minute interval, the average will be twice, ie, 6.4 customers.

poisson.pmf(10,6.4)
#0.052790043854115495

So, there is 5% chance that exactly 10 customers will arrive in a 8 minute interval.

Uniform Distribution

from scipy.stats import uniform

Q. Suppose the amount of time it takes to assemble a plastic module ranges from 27 to 39 seconds and the assembly time is uniformly distributed. Describe the distribution? What is the probability that a given assembly will take between 30 to 35 seconds?

Here we have an array that ranges from 27 to 39 that is uniformly distributed. So, we use np.arange(starting value, end_value+1, increment value) to create the array.

U=np.arange(27,40,1)
U
#array([27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39])

Here, we need to find the mean of the values. So, we use the following code:

uniform.mean(loc=27,scale=12)
#33.0

Here, loc is the starting point of the array and the value of scale is 39–27. Now, to find the probability that the assembly will take 30 to 35 seconds, we use the following code:

uniform.cdf(np.arange(30,36,1),loc=27,scale=12)
#array([0.25, 0.33333333, 0.41666667, 0.5, 0.58333333,0.66666667])

Here, 0.25 is the probability of 30 and 0.66666667 is the probability of 35. So, to find the probability of the interval, we need to subtract both the probability values.

Prob=0.66666667 - 0.25
Prob
0.41666667

Hence, there is 41% probability that a given assembly will take between 30 to 35 seconds.

Q. According to the National Association of Insurance Commissioners,the average annual cost for automobile insurance in the United States in the recent years was $691. Suppose automobile insurance costs are uniformly distributed across US with a range of $200 to $1,182.What is the standard deviation of this uniform distribution?

uniform.mean(loc=200,scale=982)
#691.0
uniform.std(loc=200,scale=982)
#283.4789821721062

Hence, the standard deviation of the uniform distribution is 283.47.

Normal Distribution

from scipy.stats import norm

If the x value is 68, mean is 65.5 and standard deviation is 2.5, then, probability of obtaining a value less than or equal to 68 is

val,m,s=68,65.5,2.5
print(norm.cdf(val,m,s))
#0.8413447460685429

If we have to find the cumulative value of x>value, ie, the probability of occurrence of values above 68, then, we subtract the above cdf value from one, ie,

print(1-norm.cdf(val,m,s))
#0.15865525393145707

If we have to find the cumulative value of probability of values in between an interval, ie, val1<x<val2, then,

print(norm.cdf(val,m,s)-(norm.cdf(63,m,s)))
#0.6826894921370859

Q. What is the probability of obtaining a score greater than 700 on a GMAT test that has a mean of 494 and a standard deviation of 100? Assume that GMAT scores are normally distributed.

So, here, we have to find the value of P(x>700),where mean =494 and standard deviation=100.

print(1-norm.cdf(700,494,100))
#0.019699270409376912

The probability of obtaining a score greater than 700 on a GMAT test is 1.9%.

Q. For the same GMAT examination, what is the probability of randomly drawing a score that is 550 or less?

print(norm.cdf(550,494,100))
#0.712260281150973

Q. What is the probability of randomly obtaining a score between 300 and 600 on a GMAT exam?

print(norm.cdf(600,494,100)-norm.cdf(300,494,100))
#0.8292378553956377

Q. What is the probability of randomly obtaining a score between 350 and 450 on a GMAT exam?

print(norm.cdf(450,494,100)-norm.cdf(350,494,100))
#0.2550348541262666

If we are given the area under the curve, then, we can find the z value, using the following code:

norm.ppf(0.95)
#1.6448536269514722

Using this z value, we can find the x value using the formula given in the previous post.

Hypergeometric Distribution

from scipy.stats import hypergeom

Q. Suppose 18 major computer companies operate in US and 12 are located in California’s Silicon Valley. If three computer companies are selected, what is the probability that one or more of selected companies are located in Silicon Valley?

Here, we can use hypergeomeric distribution. N is 18, n is 3, A is 12 and x is 1.

pval = hypergeom.sf(0,18,3,12)   #hypergeom.sf(x-1,N,n,A), sf=1-cdf
pval
#0.9754901960784306

Hence, there is 97.5% of chance that one or more of selected companies are located in Silicon Valley.

Q. A western city has 18 officers eligible for promotion. 11 of 18 are Hispanic. Suppose only 5 of the police officers are chosen for promotion. If the officers for promotion had been chosen by chance alone, what is the probability that one or fewer of the five police officers would have been Hispanic?

pval = hypergeom.cdf(1,18,5,11)
pval
#0.04738562091503275

Exponential Distribution

from scipy.stats import expon

Q. A manufacturing company has been involved in statistical quality control over several years. As part of the production process,parts are randomly selected and tested. From the records of these tests, it has been established that a defective part occurs in a pattern that is Poisson distributed on the average of 1.38 defects every 20 minutes during production runs. Use this information to determine the probability that less that 15 minutes will elapse between any two defects?

Here, time interval is 15/20=0.75 and mean is 1/1.38.

expon.cdf(0.75,(1/1.38))
#0.025043397119053856

So, there is only 2% chance that less that 15 minutes will elapse between any two defects.

Sampling and Sampling Distributions

Descriptive statistics uses the data to provide descriptions of the population, either through numerical calculations or graphs or tables. Inferential statistics makes inferences and predictions about a population based on a sample of data taken from the population in question.

Random Sampling:

  • Every unit of the population has the same probability of being included in the sample.
  • A chance mechanism is used in the selection process.
  • Eliminates bias in the selection process
  • Also known as probability sampling

Non-random Sampling:

  • Every unit of the population does not have the same probability of being included in the sample.
  • Open the selection bias
  • Not appropriate data collection methods for most statistical methods
  • Also known as non-probability sampling

Random Sampling Techniques

  1. Simple Random Samples: Every object in the population has an equal chance of being selected. The objects are selected independently. Samples can be obtained from a table of random numbers or computer random number generators. A simple random sample is the ideal against which other sample methods are compared.
  2. Stratified Random Sample: The Population is divided into non-overlapping sub-populations called strata. A random sample is selected from each stratum. It has the potential for reducing sampling error. It is proportionate , ie, the percentage of these sample taken from each stratum is proportionate to the percentage that each stratum is within the population. Also, it is disproportionate — proportions of the strata within the sample are different than the proportions of the strata within the population.
  3. Systematic Sampling: It is convenient and relatively easy to administer. The population elements are an ordered sequence (at least, conceptually). The first sample element is selected randomly from the first k population elements. Thereafter, sample elements are selected at a constant interval, k, from the ordered sequence frame, where k=N/n, and N is population size and n is sample size.
  4. Cluster Sampling: Here, the population is divided into non-overlapping clusters or areas. Each cluster is a miniature of the population. A subset of the clusters is selected randomly for the sample. If the number of elements in the subset of clusters is larger than the desired value of n, these clusters may be subdivided to form a new set of clusters and subjected to a random selection process. The advantages are that it is more convenient for geographically dispersed populations, it reduces travel costs to contact sample elements, simplified administration of the survey and the unavailability of sampling frame prohibits using other random sampling methods. The disadvantages are that it is statistically less efficient when the cluster elements are similar and also, costs and problems of statistical analysis are greater than for simple random sampling.

Nonrandom Sampling Techniques

Convenience Sampling: Sample elements are selected for the convenience of the researcher

Judgment Sampling: Sample elements are selected by the judgment of the researcher

Quota Sampling: Sample elements are selected until the quota controls are satisfied

Snowball Sampling: Survey subjects are selected based on referral from other survey respondents

Errors

Data from non-random samples are not appropriate for analysis by inferential statistical methods. Sampling Error occurs when the sample is not representative of the population. Non-sampling errors include missing data, recording, data entry, and analysis errors. Also, poorly conceived concepts , unclear definitions, and defective questionnaires can also be included in this. Response errors occur when people so not know, will not say, or overstate in their answers.

Sampling Distribution

Proper analysis and interpretation of a sample statistic requires knowledge of its distribution. Inferential Statistics is making statements about a population by examining sample results. A sampling distribution is a distribution of all of the possible values of a statistic for a given size sample selected from a population.

Sampling Distribution of Sample Mean

Expected Value of Sample Mean: Let X1, X2, . . . Xn represent a random sample from a population. The sample mean value of these observations is defined as:

Sample mean is an unbiased estimator of population mean.

Standard Error of the Mean: Different samples of the same size from the same population will yield different sample means. A measure of the variability in the mean from sample to sample is given by the Standard Error of the Mean:

Note that the standard error of the mean decreases as the sample size increases.

If sample values are not independent: If the sample size n is not a small fraction of the population size N, then individual sample members are not distributed independently of one another. Thus, observations are not selected independently. A correction is made to account for this:

If the Population is Normal:

  • If a population is normal with mean μ and standard deviation σ, the sampling distribution of x̄ is also normally distributed with
  • If the sample size n is not large relative to the population size N, then

Central Limit Theorem

If the population is not normal, then we can apply the Central Limit Theorem: Even if the population is not normal, sample means from the population will be approximately normal as long as the sample size is large enough. This means as the number of sample points increases, ie, n > 25, the distribution of the sample approaches normal, irrespective of the distribution of the population from which the sample is taken.

Sampling Distribution of Sample Proportion

Let P be the proportion of the population having some characteristic. The sample proportion (p̂ ) provides an estimate of P.

p̂ has a binomial distribution, but can be approximated by a normal distribution when nP(1 — P) > 5 and 0 < p̂< 1.

The sampling distribution of p̂ is normal. Also, E(p̂)=P and variance :

We can standardize the p̂ to a Z value with the formula:

Sampling Distributions of Sample Variance

Let x1, x2, . . . , xn be a random sample from a population. The sample variance is

The sample variance is different for different random samples from the same population.

The sampling distribution of s² has mean σ² , ie, E(s²)=σ².

The Chi-square Distribution

A chi-square distribution is a continuous distribution with k=n-1 degrees of freedom. Degrees of Freedom (df) is the number of observations that are free to vary after sample mean has been calculated. The distribution is used to describe the distribution of a sum of squared random variables. It is also used to test the goodness of fit of a distribution of data, whether data series are independent, and for estimating confidences surrounding variance and standard deviation for a random variable from a normal distribution.

χ² statistic

Confidence Interval Estimation: Single Population

An estimator of a population parameter is a random variable that depends on sample information and also whose value provides an approximation to this unknown parameter. A specific value of that random variable is called an estimate.

A point estimate is a single number, where as a confidence interval provides additional information about variability.

A point estimator θ^ is said to be an unbiased estimator of the parameter θ if the expected value, or mean, of the sampling distribution of θ^ is θ,

• Examples:

– The sample mean x̅ is an unbiased estimator of μ.

– The sample variance s² is an unbiased estimator of σ².

– The sample proportion p̂ is an unbiased estimator of P

The bias in θ^ is defined as the difference between its mean and θ, ie,

Most Efficient Estimator:

Suppose there are several unbiased estimators of θ. The most efficient estimator or the minimum variance unbiased estimator of θis the unbiased estimator with the smallest variance.

Confidence Interval Estimate

An interval estimate provides more information about a population characteristic than a point estimate does. Such interval estimates are called confidence intervals.

If P(a < θ < b) = 1 — α then the interval from a to b is called a 100(1 — α)% confidence interval of θ. The quantity (1-α) is called the confidence level of the interval (α between 0 and 1). In repeated samples of the population, the true value of the parameter θ would be contained in 100(1 — α)% of intervals calculated this way. The confidence interval calculated in this manner is written as a < θ< b with 100(1 — α)% confidence.

The general formula for all confidence intervals is:

Point Estimate ±(Reliability Factor)(Standard Error)

The value of the reliability factor depends on the desired level of confidence.

Finding the Reliability Factor, z(α/2)

Confidence Interval for μ (σ² Known)

Assumptions:

  • Population variance σ² is known
  • Population is normally distributed
  • If population is not normal, use large sample

So the Confidence interval estimate is given by:

where z(α/2) is the normal distribution value for a probability of α/2 in each tail

Margin of Error:

The confidence interval can be written as x̅ ± ME. So, margin of error is:

The margin of error can be reduced if:

  • The population standard deviation can be reduced (σ↓)
  • The sample size is increased (n↑)
  • The confidence level is decreased, (1 — α) ↓

Student’s t Distribution

Consider a random sample of n observations with mean x̅ and standard deviation s from a normally distributed population with mean μ. Then the variable:

follows the Student’s t distribution with (n — 1) degrees of freedom.

Confidence Interval for μ (σ² Unknown)

If the population standard deviation σ is unknown, we can substitute the sample standard deviation, s. This introduces extra uncertainty, since s is variable from sample to sample. So we use the t distribution instead of the normal distribution.

Assumptions:

  • Population standard deviation is unknown
  • Population is normally distributed
  • If population is not normal, use large sample

Using the Student’s t Distribution, the Confidence Interval Estimate is:

where t(n-1,α/2) is the critical value of the t distribution with n-1 d.f. and an area of α/2 in each tail

The margin of error here is:

Margin of Error
As n increases, t distribution tends to become normal distribution

Confidence Intervals for the Population Proportion

An interval estimate for the population proportion ( P ) can be calculated by adding an allowance for uncertainty to the sample proportion ( pˆ). Recall that the distribution of the sample proportion is approximately normal if the sample size is large, with standard deviation:

Upper and lower confidence limits for the population proportion are calculated with the formula:

Confidence Intervals for the Population Variance

The confidence interval is based on the sample variance s² and it is assumed that the population is normally distributed. The random variable

follows a chi-square distribution with (n — 1) degrees of freedom. The (1-α)% confidence interval for the population variance is:

Finite Populations

If the sample size is more than 5% of the population size (and sampling is without replacement) then a finite population correction factor must be used when calculating the standard error. Suppose sampling is without replacement and the sample size is large relative to the population size. Assume the population size is large enough to apply the central limit theorem. Apply the finite population correction factor when estimating the population variance.

Estimating Population Mean:

Let a simple random sample of size n be taken from a population of N members with mean μ. The sample mean is an unbiased estimator of the population mean μ. If the sample size is more than 5% of the population size, an unbiased estimator for the variance of the sample mean is:

Estimating the Population Proportion

Let the true population proportion be P. Let be the sample proportion from n observations from a simple random sample. The sample proportion,p̂ , is an unbiased estimator of the population proportion, P. If the sample size is more than 5% of the population size, an unbiased estimator for the variance of the population proportion is:

Summary

•Introduced sampling distributions

•Described the sampling distribution of sample means

–For normal populations

–Using the Central Limit Theorem

•Described the sampling distribution of sample proportions

•Introduced the chi-square distribution

•Examined sampling distributions for sample variances

  • Calculated probabilities using sampling distributions

--

--

Teena Mary
Budding Data Scientist

I’m a post graduate student doing Data Science from Christ (Deemed to be University) in Bengaluru.