Probability and Statistics

Population and Sample

Gokulakrishnan
Analytics Vidhya
5 min readSep 27, 2020

--

In order to understand the population and sample let’s take an example of average height of humans.While writing this post i googled for world population and it says world population is estimated to have reached 7.8 Billion people as of March 2020.Let’s take it as about 7 Billion.At the same time i googled for average height of human and i come to know it is “1.7 m” for “men” and “1.6 m” for “women”.

Do you know how to find the average?

It so simple,Just add all the numbers and divide it by the number of numbers, Which is also called “mean”.

In case of average height of human, Do you think we can calculate the average height of human by adding heights of all the people in the world and dividing them with total number of people (about 7 Billion)?Is it possible to obtain the heights of 7 Billion people?

This is where Probability comes in hand.The concept of Population and Sample can help to solve the problem in simple steps.

So what is Population and Sample in Statistics?

Population : A population includes all of the elements from a set of data.In above case it is the heights of all the people in the world

Sample : A sample is basically a subset of population picked up randomly.Such randomly picked up samples are called random sample. In above cases a small sample of heights will be drawn randomly in equal proportion. The equal proportion in this case is while randomly selecting the heights, the people from different countries should be selected and they should be in equal proportions.

import numpy as np
lis=[int(i) for i in range(1,1001)]
res=[]
j=99

for i in range(10):
l = np.random.choice(lis[(j-99):j],10,replace=False)
res.extend(l)
j+=100
print("mean of sample:",np.mean(res))
print("mean of sample:",np.mean(lis))
mean of sample: 501.48
mean of sample: 500.5

In order to estimate the average height of people we have to find the average of the Billion people heights. this can be done by the following equation:

image.png

Here mu is the mean of total population

The mean of sample population can be obtained by

image.png

here h-bar is the mean height of the sample and hi is the heights of people in sample.

Here for understanding i have created a simple script that calculates average of 1000 numbers using population and sample concept.The population contains numbers from 1- 1000. I am splitting 1000 numbers in to 10 category with 100 numbers in that.Now from that 10 category i am drawing randomly 10 number from each such that they are in equal proportion. Now when we calculate the mean for both population and sample we can see that mean of sample is much closer to the population

import numpy as np
lis=[int(i) for i in range(1,1001)]
res=[]
j=99

for i in range(10):
l = np.random.choice(lis[(j-99):j],10,replace=False)
res.extend(l)
j+=100
print("mean of sample:",np.mean(res))
print("mean of poplation:",np.mean(lis))
mean of sample: 498.81
mean of poplation: 500.5

Using this concept we can easily calculate the average height of human just by randomly picking up sample heights from different part of the world in equal proportions so as to obtain the closer value to the mean heights of population.

This selecting the number of elements in samples is not restricted.We can pick any number of elements for a sample.As the number elements in the sample increase the sample mean converges close to the population mean

import numpy as np
lis=[int(i) for i in range(1,1001)]
res=[]
j=199

for i in range(5):
l = np.random.choice(lis[(j-199):j],10,replace=False)
res.extend(l)
j+=200
print("mean of sample:",np.mean(res))
print("mean of poplation:",np.mean(lis))
mean of sample: 499.36
mean of poplation: 500.5

One of the important thing to be considered while taking samples from different features in this case from different countries, the number of elements collected from every country should be in equal proportion.This process of splitting samples is called sampling.

There are 2 types of Sampling techniques in Statistics. They are

  • Simple Random Sampling
  • Stratified Sampling

a) Simple Random Sampling:

In this technique, every data point/record in the sample is picked randomly. Every data point/record has equal chances of being selected. The selection of one data point into the sample will not affect the selection of any other point in the population.

b) Stratified Sampling:

In this sampling technique, if we have different groups in the population, then the sample will be selected in such a way that the count of each group in the sample is proportional to the count of those groups in the population.

I can explain both these techniques with example Problem Statement: We have to estimate the average score of 1000 students in Machine Learning Examination. (Note: In universities, ML is a subject and students from different streams can choose this subject as an elective)

Out of the 1000 students, 400 are from Engineering background, 300 are from Statistics background, 200 are from Pharmacy background(Data Science is also used in Pharmacy Domain), 100 are from Business background.

Here Population = Total number of students who have appeared for ML examination = 1000 Let us pick a sample of 250 students. So sample size is 250

a) In simple random sampling, the sample may contain either all the 250 students from Engineering background, or all the 250 from Statistics background or a combination of Engineering and Pharmacy background, or smaller subsets of each background, or students from only engineering & statistics, etc. It could be any possible combination. It also may contain the population in equal proportions This sample with improper proportions is called BIASED SAMPLE and using this biased sample leads to biased estimates.

b) In Stratified sampling technique, the sample is picked in such a way that the all the category sizes in the sample are proportional to the sizes in the population.

In total population of 1000, Engineering — 400 (40% of 1000) Statistics — 300 (30% of 1000) Pharmacy — 200 (20% of 1000) Business — 100 (10% of 1000)

So in Stratified sampling, the sample of 250 members is selected as below. Engineering — 100 (40% of 250) Statistics — 75 (30% of 250) Pharmacy — 50 (20% of 250) Business — 25 (10% of 250)

The sample selected in Stratified sampling is an UNBIASED SAMPLE. Using an Unbiased sample leads to accurate estimates.

Note:

  1. If there are no sub-groups in the population, then it is best to go for Simple random sampling as there are no chances for the samples to be BIASED. Samples may become BIASED in Simple random sampling only when there are sub-groups in the population.
  2. If there are sub-groups in the population, it is recommended to go for Stratified Sampling as the samples need to be UNBIASED.

Hope you find this post an interesting one. Thank you and have nice machine learning journey.

--

--