Probabability Sampling : Implementation in Python

Published in

Analytics Vidhya

3 min readFeb 8, 2021

Data are facts or statistics which can be used for reference and analysis. The entire data available is called population. The data which we are going to use for our use in analytics/ ML etc is known as sample. Sample is basically a subset of the population.

A real life example of population would be all the colleges in India and the sample would be the Engineering colleges in Kolkata, India.

What is probability sampling?

Probability sampling is a sampling method in which the samples are chosen from a larger population using a method based on the theory of probability. The most important requirement of probability sampling is that every data in the population has a known and equal chance of getting selected.

Random Sampling

In this type of sampling, each member of the population has equal chance of being in the sample. In this method, samples are selected randomly. A sample chosen randomly is meant to be an unbiased representation of the total population. Let us implement Random Sampling in Python. We take the Employee’s Monthly Salary Sample Dataset.

Sample - Employee's Monthly Salary

Monthly salary of employees in Company "ABC"

www.kaggle.com

Let us start by reading the data.

import numpy as np 
import pandas as pd
#read the data
df=pd.read_csv("/kaggle/input/sample-employees-monthly-salary/Employee_monthly_salary.csv")df.head()

In python, Random sampling is very easy to implement, using the .sample() function.

#taking 200 units

df.sample(200)

Random sampling is one of the easiest forms of collecting data from the total population. Under random sampling, each member of the population carries an equal opportunity of being chosen as a part of the sampling process.

Systematic Sampling

Systematic sampling is a probability sampling method where elements from a target population are chosen by selecting a random starting point and selecting sample members after a fixed ‘sampling interval.’ This is implemented easily by selecting each ‘N’th element. Systematic sampling ensures that the full population is represented fairly.

Here, we do systematic sampling, choosing every 10th element.

df.iloc[0:1802:10]

Stratified Sampling

Stratum is a subset of the population having at least one common characteristic. Further sampling is done to select sufficient number of samples from each stratum. Stratified sampling is a common sampling technique used when trying to draw conclusions from different sub-groups or strata.

Here, we do stratified sampling. First we choose only male data points, then we choose random data points. Similar can be done for the female data points, finally an aggregate of the data can be taken.

df_male=df[df["Gender"]=="M"]
df_male.head()

df_male.sample(100)

Similar steps can be executed for the female data.

For the entire code-

Statistics for DS

Explore and run machine learning code with Kaggle Notebooks | Using data from Sample - Employee's Monthly Salary