Probabability Sampling : Implementation in Python

Prateek Majumder
Analytics Vidhya
Published in
3 min readFeb 8, 2021

Data are facts or statistics which can be used for reference and analysis. The entire data available is called population. The data which we are going to use for our use in analytics/ ML etc is known as sample. Sample is basically a subset of the population.

A real life example of population would be all the colleges in India and the sample would be the Engineering colleges in Kolkata, India.

Population vs Sample.

What is probability sampling?

Probability sampling is a sampling method in which the samples are chosen from a larger population using a method based on the theory of probability. The most important requirement of probability sampling is that every data in the population has a known and equal chance of getting selected.

Random Sampling

In this type of sampling, each member of the population has equal chance of being in the sample. In this method, samples are selected randomly. A sample chosen randomly is meant to be an unbiased representation of the total population. Let us implement Random Sampling in Python. We take the Employee’s Monthly Salary Sample Dataset.

Let us start by reading the data.

import numpy as np 
import pandas as pd
#read the data
df=pd.read_csv("/kaggle/input/sample-employees-monthly-salary/Employee_monthly_salary.csv")
df.head()
Data Overview.

In python, Random sampling is very easy to implement, using the .sample() function.

#taking 200 units

df.sample(200)
Random Sampling.

Random sampling is one of the easiest forms of collecting data from the total population. Under random sampling, each member of the population carries an equal opportunity of being chosen as a part of the sampling process.

Systematic Sampling

Systematic sampling is a probability sampling method where elements from a target population are chosen by selecting a random starting point and selecting sample members after a fixed ‘sampling interval.’ This is implemented easily by selecting each ‘N’th element. Systematic sampling ensures that the full population is represented fairly.

Here, we do systematic sampling, choosing every 10th element.

df.iloc[0:1802:10]
Systematic Sampling.

Stratified Sampling

Stratum is a subset of the population having at least one common characteristic. Further sampling is done to select sufficient number of samples from each stratum. Stratified sampling is a common sampling technique used when trying to draw conclusions from different sub-groups or strata.

Here, we do stratified sampling. First we choose only male data points, then we choose random data points. Similar can be done for the female data points, finally an aggregate of the data can be taken.

df_male=df[df["Gender"]=="M"]
df_male.head()
Male Data points.
df_male.sample(100)
Random Sampling on the Male Only data.

Similar steps can be executed for the female data.

For the entire code-

So, we can understand that sampling is an important concept in Analytics and Data Science. It ensures that we get a workable data subset on which we can work.

--

--