Sampling techniques and its implementation in python
While conducting a research or finding a conclusion about a group of people, it becomes immpossible for anyone to collect all data from that group. Hence one selects a sample data which represents the representative of the whole group of data. This method is basically known as sampling. Here we will come across two keywords, i.e. population and sample data. Population is the group of data from where the sample is created.
There are two kinds of sampling: i) Probability Distribution ii) Non-Probability Sampling
Probability Distribution Sampling
Probability sampling is a method of selecting a sample from a population in a way that each member of the population has an equal and known chance of being selected. This means that every individual in the population has a fair and unbiased opportunity of being chosen for the sample, which increases the reliability and validity of the results obtained from the sample.
There are different types of probability sampling, few of them are undermentioned:
i) Simple Random Sampling:
Theory: In this technique, each member of the population is assigned a unique number, and a sample is selected by using a random number generator or a table of random numbers. One big advantage of this technique is that it is the most direct method of probability sampling.
Implementation in python:
#importing the random module
import random
#defining the population from where sample will be created
population = list(range(1, 100))
#defining the size of sample
sample_size = 10
#perform simple random sampling by using the random.sample() function
sample = random.sample(population, sample_size)
#it will print 10 random numbers within the range provided
print("Simple random sampling of 10 numbers are: ", sample)
##output:
##Simple random sampling of 10 numbers are: [40, 34, 60, 96, 8, 95, 94, 73, 93, 26]
ii) Systematic Sampling:
Theory: Systematic sampling or interval sampling is a probability sampling technique used to select a sample from a population in a systematic way. In systematic sampling, the population is first ordered in some way, and then every nth individual is selected for inclusion in the sample, where “n” is a predetermined sampling interval. For example, if the sampling interval is 5, then every 5th individual in the population will be selected for inclusion in the sample.
Implementation in python
import numpy as np
# Define the population
population = np.arange(1, 100)
# Define the sample size and sampling interval
# We can provide sample size and sampling interval as per user
# Here sampling interval is 9 (99//10 = 9)
sample_size = 10
sampling_interval = len(population) // sample_size
# Define the starting point of the sample
# refer to https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html for random.randint()
start_point = np.random.randint(0, sampling_interval)
# Perform systematic sampling
sample = population[start_point::sampling_interval]
# Print the sample
print("Systematic or interval sampling of 10 numbers are: ",sample)
##Output:
##Systematic or interval sampling of 10 numbers are: [ 6 15 24 33 42 51 60 69 78 87 96]
##Random numbers are generated from integer 6 with 9th interger as 15
iii) Stratified Sampling:
Theory: When the population is having distinct categories or sub-groups then the sample frame can be categorized as independent ‘strata’. Each strata is then sampled as an independent sub-population, out of which individual elements can be randomly selected. Such sub-population are called stratified sampling. A meaningful real world example of stratified sampling would be political survey. A town is having a population based on different races, religion, gender, education, cultures etc. A stratified sampling based on these factors could thus claim to be more representative of the population than a survey of simple random sampling or systematic sampling.
Implementation in python
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the data into a Pandas DataFrame
data=pd.read_csv("https://raw.githubusercontent.com/ayan-zz/Statistics_python/main/titanic.csv")
# Specify the stratification variable
stratify_by = 'Sex'
# Split the data into training and testing sets, with stratification
train, test = train_test_split(data, test_size=0.3, stratify=data[stratify_by])
# Check the distribution of the stratification variable in the training and testing sets
print("Train dataset:\n", train[stratify_by].value_counts())
print("Test dataset:\n", test[stratify_by].value_counts())
##Output:
Train dataset:
male 403
female 220
Name: Sex, dtype: int64
Test dataset:
male 174
female 94
Name: Sex, dtype: int64
iv) Cluster Sampling:
Theory: The cluster sampling of population is only performed when a heterogeneous group or cluster is evident in a population. The clusters selected must be mutually exclusive and collectively exhaustive, which will be a small representative of the population. A sample of group is selected by simple random sampling between the clusters. All the elements within the cluster and group are included in the study. This is called first stage clustering of sampling. Further more if a random sampling technique is applied to the elements from each of the selected clusters then it is called second stage sampling. For example, customer survey of a MNC. A company wants to conduct a survey to gather feedback from its customers. The company has a database of customer addresses, and it divides the database into clusters based on geographic location. Then, a random sample of these clusters is selected, and all the customers within the selected clusters are surveyed.
import random
# create a list of population data
population_data = [10, 15, 20, 26, 29, 33, 35, 37, 41, 42, 46, 48, 50, 52, 55, 58, 61, 64, 68, 72]
# set the desired cluster size
cluster_size = 5
# randomly select a starting point for the first cluster
starting_point = random.randint(0, cluster_size + 1)
# create a list to store the sampled data
sampled_data = []
# loop through the population data by clusters of size cluster_size
for i in range(starting_point, len(population_data), cluster_size):
# append the data from the current cluster to the sampled data list
sampled_data.append(population_data[i+1:i+cluster_size])
# print the sampled data
print(starting_point)
print(sampled_data)
## Output
# 0
# [[23, 26, 29], [37, 41, 42], [50, 52, 55], [64, 68, 72]]
Multi-stage Sampling: It is a part of clustering sampling where multiple clustering or sampling are performed one after another. Once a sub-group is formed one can again sample the data from the first sub-group untill the result is fulfilled. We can consider clustering of samples or any other techniques after one stage. For example performing a simple random sampling of the above clustered sampled data, ’sampled_data’.
# from previous example
sample_size = 3
group_sample = random.sample(sampled_data, sample_size)
print(group_sample)
## output
# [46, 58, 72]
Non-probability Sampling
It is the type of selection of population where the selection is not random and they may not be a direct representative of the population. In non-probability sampling, the chances of any particular member of the population being selected for the sample are not known, for example , geographical proximity, availability etc.
They can be divided into several types such as:
i) Convenience Sampling: It is determined by the convinience of the researcher, that includes geographical proximity and availability. These datapoints can be collected easily and quickly for study.
ii) Quota Sampling: This is a type of non-probability sampling where participants are selected based on pre-specified quotas. Quota sampling is often used in market research studies where the researcher wants to ensure that the sample is representative of the population in terms of certain characteristics, such as age, gender, or income.
iii) Purposive(judgemental) Sampling: This is a sampling where participants are selected based on specific criteria. Purposive sampling is often used in qualitative research studies where the researcher wants to gather in-depth information from individuals who have a particular perspective or experience. For example: A researcher is interested in studying the experiences of individuals who have been diagnosed with a rare genetic disorder. The researcher identifies a clinic that specializes in treating the disorder and recruits participants from the clinic. The researcher may also use other inclusion criteria, such as age or gender, to further refine the sample.
iv) Snowball Sampling: This type of sampling is used where population is hard to identify. An initial criterion is determined, the samples are selected initially through that criteria and then asking them to refer other individuals who also meet the criteria. For example: referals in job search.
For more information and knowledge on non-probability sampling, visit: https://www.scribbr.com/methodology/non-probability-sampling/
I hope you have enjoyed reading this blog. Please don’t forget to look into the following for further study: