What Is Data Sampling and Statistical Techniques for Effective Sampling in Machine Learning

May 14 · 4 min read

Data is the currency of the applied AI landscape. So it is of utmost importance that we make the best of data available and utilize it to draw practical and effective conclusions to solve real-world problems.

One of the biggest problems we face in applied machine learning is dealing with huge amounts of data on machines with limited computational power. Our machines are all too excited to throw the dreaded “out of memory” exception at us even when dealing with *slightly* large data sets.

So how do we overcome this persisting issue? Is there a way to select and analyze a subset of data that can be a good representation and then extrapolate our conclusions to the entire data set?

Let me introduce some basic terminology before diving into the topic:

Population: Includes ALL possible outcomes or measurements or data points that are of interest.

Sample: A subset of observations drawn from the population

Sampling: The process of selecting such a sample is called Sampling.

“Sampling is a statistical method that allows us to select a subset of data points from the population to analyze and characterize the whole population.”

Different types of sampling techniques:

1. Probability Sampling — This sampling method is based on probability. Every element of the population has an equal chance of being selected. Hence, probability sampling gives us the best chance to create a truly representative sample of the whole population.
2. Non-Probability Sampling — In non-probability sampling, all elements do not have an equal chance of being selected. Consequently, there is a significant risk of ending up with a non-representative sample that does not produce generalizable results.

For now, we’ll just look at probability sampling as it has the least selection bias out of the two and is the most used sampling technique in applied machine learning.

Types of Probability Sampling:

1. Simple Random Sampling

In this type of simple random sampling. Each individual is chosen randomly and entirely by chance, such that each individual has the same probability or chance of being chosen at any stage during the sampling process. Simple random sampling reduces selection bias

2. Systematic Sampling

This technique is also random but done through maintaining a system or formula. The subjects could be selected at regular intervals from the entire population.

Systematic sampling is more convenient than simple random sampling. However, it might also lead to bias if there is an underlying pattern in which we are selecting items from the population.

3. Stratified Sampling

In Stratified random sampling, the entire population is divided into multiple non-overlapping, homogeneous groups (strata) and randomly choose final members from the various strata for research. Members in each of these groups should be distinct so that every member of all groups get an equal opportunity to be selected using simple probability. We use this type of sampling when we want representation from all the subgroups of the population.

4. Cluster Sampling

In a clustered sample, we use the subgroups of the population as the sampling unit rather than individuals. The population is divided into subgroups, known as clusters, and a whole cluster is randomly selected to be included in the study. This type of sampling is used when we focus on a specific region or domain.

Selection Bias

It is defined as a systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample,

So how to mitigate selection bias?

1. Ensuring that the subgroups selected are equivalent to the population at large in terms of their key characteristics (this method is less of a protection than the first since typically the key characteristics are not known).
2. Using random methods when selecting subgroups from populations.
3. Make sure to use probability sampling techniques when sampling data.

References:

Connect with me on Linkedin and Medium for more articles on statistics, machine learning.

Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Written by

Sindhu Seelam

Transitioning ML/AI Engineer. I’m passionate about learning & writing about my journey into the AI world. https://www.linkedin.com/in/sindhuseelam/

Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

9 Efficient Ways for Describing and Summarizing a Pandas DataFrame

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app