ML Series: Day 36 — Types of Sampling in Statistics

Ebrahim Mousavi
5 min readJun 30, 2024

--

Types of Sampling in Statistics

Sampling is a fundamental concept in statistics that involves selecting a subset of individuals or observations from a larger population to estimate characteristics of the whole population. Effective sampling methods ensure that the selected subset accurately represents the population, reducing bias and improving the reliability of statistical inferences. Here are some of the most important sampling methods:

1. Simple Random Sampling

Simple Random Sampling (SRS) is a method where every individual has an equal chance of being selected. This method ensures that the sample is representative of the population.

Python Code Example:

import numpy as np

# Population
population = np.arange(1, 101) # A population of 100 individuals

# Simple Random Sampling
sample_size = 10
simple_random_sample = np.random.choice(population, size=sample_size, replace=False)

print("Simple Random Sample:", simple_random_sample)

# Simple Random Sample: [ 3 38 97 94 92 76 71 48 89 49]

Advantages:
- Minimizes selection bias.
- Simple to implement and understand.

Disadvantages:
- May not be feasible for very large populations.
- Does not ensure that subgroups are proportionately represented.

2. Stratified Sampling

Stratified Sampling involves dividing the population into distinct subgroups, or strata, that share similar characteristics. Samples are then drawn from each stratum, often in proportion to the stratum’s size relative to the population.

Steps for Stratified Sampling:
1. Identify Strata: Determine the characteristics that define each stratum (e.g., age, gender, income level).
2. Divide Population: Split the population into these strata.
3. Random Sampling within Strata: Perform simple random sampling within each stratum.

Python Code Example:

import numpy as np
import pandas as pd


# Creating a hypothetical population data with strata
data = {
'Stratum': ['A']*50 + ['B']*50,
'Value': np.random.randint(1, 100, 100)
}
df = pd.DataFrame(data)

# Stratified Sampling
stratified_sample = df.groupby('Stratum', group_keys=False).apply(lambda x: x.sample(frac=0.1))

print("Stratified Sample:\n", stratified_sample)

Output:

Advantages:
- Ensures representation of all subgroups.
- Provides more precise estimates for each stratum.

Disadvantages:
- Requires detailed knowledge of population characteristics.
- Can be more complex and time-consuming to administer.

3. Cluster Sampling

Cluster Sampling involves dividing the population into clusters, usually based on geographic or other natural groupings. A random sample of clusters is then selected, and all or a random sample of members within chosen clusters are studied.

Steps for Cluster Sampling:
1. Define Clusters: Identify clusters within the population (e.g., schools, neighborhoods).
2. Random Selection of Clusters: Use simple random sampling to select which clusters to include.
3. Sampling within Clusters: Conduct a census or a random sample within the selected clusters.

Python Code Example:

import numpy as np
import pandas as pd

# Creating a hypothetical population data with clusters
data = {
'Cluster': ['A']*5 + ['B']*5 + ['C']*5 + ['D']*5,
'Value': np.random.randint(1, 20, 20)
}
df = pd.DataFrame(data)

# Cluster Sampling
clusters = df['Cluster'].unique()
chosen_clusters = np.random.choice(clusters, size=2, replace=False)
cluster_sample = df[df['Cluster'].isin(chosen_clusters)]

print("Cluster Sample:\n", cluster_sample)

Output:

Advantages:
- Cost-effective and practical for geographically dispersed populations.
- Reduces travel and administrative costs.

Disadvantages:
- Higher sampling error compared to SRS and stratified sampling.
- Less representative if clusters are heterogeneous.

4. Systematic Sampling

Systematic Sampling involves selecting every k-th member of the population after a random starting point. This method is useful when the population is logically ordered or when a complete list is available.

Steps for Systematic Sampling:
1. Determine Sampling Interval (k):
Calculate k by dividing the population size by the desired sample size.
2. Random Start: Choose a random starting point between 1 and k.
3. Select Sample: Select every k-th individual after the starting point.

Python Code Example:

import numpy as np

# Population
population = np.arange(1, 101)

# Systematic Sampling
sample_size = 10
k = len(population) // sample_size
systematic_sample = population[::k]

print("Systematic Sample:", systematic_sample)

# Systematic Sample: [ 1 11 21 31 41 51 61 71 81 91]

Advantages:
- Simple and quick to implement.
- Ensures even coverage of the population.

Disadvantages:
- Can introduce bias if there is a hidden pattern in the population.

5. Convenience Sampling

Convenience Sampling (also known as availability sampling) involves selecting individuals who are easiest to reach. This method is often used in exploratory research where random sampling is impractical.

Python Code Example:

import numpy as np

# Population
population = np.arange(1, 101)

# Convenience Sampling
convenience_sample = population[:10]

print("Convenience Sample:", convenience_sample)
# Convenience Sample: [ 1 2 3 4 5 6 7 8 9 10]

Advantages:
- Easy and quick to administer.
- Cost-effective.

Disadvantages:
- High risk of bias.
- Results may not be generalizable to the entire population.

In conclusion, various sampling techniques such as Simple Random Sampling, Stratified Sampling, Cluster Sampling, Systematic Sampling, and Convenience Sampling offer diverse methodologies for selecting samples based on specific research needs and practical considerations. Each method brings its own strengths and limitations, allowing researchers to tailor their sampling approach to best suit the requirements of their study.

In our Machine Learning journey, we explored different types of sampling with codes in Python. Looking ahead to Day 37, where we will cover Point Estimation, including properties of estimators such as unbiasedness and consistency. We will also delve into methods like Maximum Likelihood Estimation.

If you like the article and would like to support me make sure to:

👏 Clap for the story (as much as you liked it 😊) and follow me 👉
📰 View more content on my medium profile
🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter

References:

  1. https://www.geeksforgeeks.org/simple-random-sampling/
  2. https://www.scribbr.com/methodology/stratified-sampling/
  3. https://www.scribbr.com/methodology/cluster-sampling/
  4. https://www.scribbr.com/methodology/systematic-sampling/
  5. https://research-methodology.net/sampling-in-primary-data-collection/convenience-sampling/

--

--