What Is Data Sampling and Statistical Techniques for Effective Sampling in Machine Learning

Sindhu Seelam
Geek Culture
Published in
4 min readMay 14, 2021

Data is the currency of the applied AI landscape. So it is of utmost importance that we make the best of data available and utilize it to draw practical and effective conclusions to solve real-world problems.

One of the biggest problems we face in applied machine learning is dealing with huge amounts of data on machines with limited computational power. Our machines are all too excited to throw the dreaded “out of memory” exception at us even when dealing with *slightly* large data sets.

So how do we overcome this persisting issue? Is there a way to select and analyze a subset of data that can be a good representation and then extrapolate our conclusions to the entire data set?

Let me introduce some basic terminology before diving into the topic:

Population: Includes ALL possible outcomes or measurements or data points that are of interest.

Sample: A subset of observations drawn from the population

Sampling: The process of selecting such a sample is called Sampling.

“Sampling is a statistical method that allows us to select a subset of data points from the population to analyze and characterize the whole population.”

Image by Author

Different types of sampling techniques:

There’re 2 types of sampling techniques that are most commonly used in machine learning. Choosing the correct and effective sampling technique is absolutely vital in determining the success or failure of a study or research.

  1. Probability Sampling — This sampling method is based on probability. Every element of the population has an equal chance of being selected. Hence, probability sampling gives us the best chance to create a truly representative sample of the whole population.
  2. Non-Probability Sampling — In non-probability sampling, all elements do not have an equal chance of being selected. Consequently, there is a significant risk of ending up with a non-representative sample that does not produce generalizable results.

For now, we’ll just look at probability sampling as it has the least selection bias out of the two and is the most used sampling technique in applied machine learning.

Types of Probability Sampling:

There’re 4 types of probability sampling.

Image by Author
  1. Simple Random Sampling

In this type of simple random sampling. Each individual is chosen randomly and entirely by chance, such that each individual has the same probability or chance of being chosen at any stage during the sampling process. Simple random sampling reduces selection bias

2. Systematic Sampling

This technique is also random but done through maintaining a system or formula. The subjects could be selected at regular intervals from the entire population.

Systematic sampling is more convenient than simple random sampling. However, it might also lead to bias if there is an underlying pattern in which we are selecting items from the population.

3. Stratified Sampling

In Stratified random sampling, the entire population is divided into multiple non-overlapping, homogeneous groups (strata) and randomly choose final members from the various strata for research. Members in each of these groups should be distinct so that every member of all groups get an equal opportunity to be selected using simple probability. We use this type of sampling when we want representation from all the subgroups of the population.

4. Cluster Sampling

In a clustered sample, we use the subgroups of the population as the sampling unit rather than individuals. The population is divided into subgroups, known as clusters, and a whole cluster is randomly selected to be included in the study. This type of sampling is used when we focus on a specific region or domain.

Selection Bias

One of the obvious inhibitors of an effective sampling technique is selection bias. Selection bias is usually introduced as an error with the sampling and having a selection for analysis that is not properly randomized.

It is defined as a systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample,

So how to mitigate selection bias?

  1. Ensuring that the subgroups selected are equivalent to the population at large in terms of their key characteristics (this method is less of a protection than the first since typically the key characteristics are not known).
  2. Using random methods when selecting subgroups from populations.
  3. Make sure to use probability sampling techniques when sampling data.

References:

Connect with me on Linkedin and Medium for more articles on statistics, machine learning.

--

--

Sindhu Seelam
Geek Culture

Transitioning ML/AI Engineer. I’m passionate about learning & writing about my journey into the AI world. https://www.linkedin.com/in/sindhuseelam/