Sampling Techniques— Statistical approach in Machine learning

Published in

Analytics Vidhya

10 min readJan 16, 2021

When we have a big dataset and excited to get started with analyzing it and building your machine learning model. Our machine gives an “out of memory” error while trying to load the dataset.

It’s happened to us most of the time when we have big dataset. Big dataset is one of the biggest hurdles we face in data science — dealing with massive amounts of data on computationally limited machines (of course we can resolve it with additional resource power).

So how can we overcome this problem? Is there a way to pick a subset of the data and analyze that — and that can be a good representation of the entire dataset?

Here comes the statistical approach to deal with bigger dataset called “Sampling”.

“Sampling is a method that allows us to get information about the population based on the statistics from a subset of the population (sample), without having to investigate every individual”

Example: When you conduct research about a group of people, it’s rarely possible to collect data from every person in that group. Instead, you select a sample. The sample is the group of individuals who will actually participate in the research.

Advantages and Challenges of Data Sampling

Sampling can be particularly useful with data sets that are too large to efficiently analyze in full — for example, in big data analytics applications or surveys. Identifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of the data or population.

There are many benefits to sampling compared to working with fuller or complete datasets, including reduced cost and greater speed.

An important consideration, though, is the size of the required data sample and the possibility of introducing a sampling error. In some cases, a small sample can reveal the most important information about a data set. In others, using a larger sample can increase the likelihood of accurately representing the data as a whole, even though the increased size of the sample may impede ease of manipulation and interpretation.

Sampling Framework (Steps):

In order to perform sampling, it requires that you carefully define your population and the method by which you will select (and possibly reject) observations to be a part of your data sample. This may very well be defined by the population parameters that you wish to estimate using the sample.

Some aspects to consider prior to collecting a data sample include:

Sample Goal. The population property that you wish to estimate using the sample.
Population. The scope or domain from which observations could theoretically be made.
Selection Criteria. The methodology that will be used to accept or reject observations in your sample.
Sample Size. The number of observations that will constitute the sample.

Steps involved in sampling framework:

Step 1: The first stage in the sampling process is to clearly define the target population.

Step 2: Sampling Frame — It is a list of items or people forming a population from which the sample is taken.

Step3: Generally, probability sampling methods are used because every vote has equal value and any person can be included in the sample irrespective of his caste, community, or religion. Different samples are taken from different regions all over the country.

Step 4: Sample Size — It is the number of individuals or items to be taken in a sample that would be enough to make inferences about the population with the desired level of accuracy and precision. Larger the sample size, more accurate our inference about the population would be. For the polls, agencies try to get as many people as possible of diverse backgrounds to be included in the sample as it would help in predicting the number of seats a political party can win.

Step 5: Once the target population, sampling frame, sampling technique, and sample size have been established, the next step is to collect data from the sample. In opinion polls, agencies generally put questions to the people, like which political party are they going to vote for or has the previous party done any work, etc. Based on the answers, agencies try to interpret who the people of a constituency are going to vote for and approximately how many seats is a political party going to win. Pretty exciting work, right?!

Different Types of Sampling Techniques

Probability Sampling: In probability sampling, every element of the population has an equal chance of being selected. Probability sampling gives us the best chance to create a sample that is truly representative of the population
Non-Probability Sampling: In non-probability sampling, all elements do not have an equal chance of being selected. Consequently, there is a significant risk of ending up with a non-representative sample which does not produce generalizable results

Types of Probability Sampling

1. Simple Random Sampling

This is a type of sampling technique you must have come across at some point. Here, every individual is chosen entirely by chance and each member of the population has an equal chance of being selected. Simple random sampling reduces selection bias. Simple random sampling reduces the chances of sampling error. Sampling error is lowest in this method out of all the methods.

2. Cluster Sampling

In a clustered sample, we use the subgroups of the population as the sampling unit rather than individuals. The population is divided into subgroups, known as clusters, and a whole cluster is randomly selected to be included in the study. This type of sampling is used when we focus on a specific region or area.

There are different types of cluster sampling — single stage, two stage and multi stage cluster sampling methods.

Example : In population survey clusters are identified and included in a sample based on demographic parameters like age, sex, location, etc. This makes it very simple for a survey creator to derive effective inference from the feedback.

3. Systematic Sampling

Systematic sampling is a statistical method that researchers use to zero down on the desired population they want to research. Researchers calculate the sampling interval by dividing the entire population size by the desired sample size. Systematic sampling is an extended implementation of probability sampling in which each member of the group is selected at regular periods to form a sample.

Systematic sampling is defined as a probability sampling method where the researcher chooses elements from a target population by selecting a random starting point and selects sample members after a fixed ‘sampling interval.’

For example, in school, while selecting the captain of a sports team, most of our coaches asked us to call out numbers such as 1–5 (1-n) and the students with a random number decided by the coach. For instance, three would be called out to be the captains of different teams. It is a non-stressful selection process for both the coach and the players. There’s an equal opportunity for every member of a population to be selected using this sampling technique.

For example, a researcher wants to choose 2000 people amongst the population of 10,000 people with the help of systematic sampling. He must enlist all the potential participants, and accordingly, a starting point will be selected. As soon as this list gets formed, every 5th person from the list would be selected as a participant, as 10,000/2000 = 5

Here are the types of systematic sampling:

Systematic random sampling
Linear systematic sampling
Circular systematic sampling

4. Stratified random Sampling

In this type of sampling, we divide the population into subgroups (called strata) based on different traits like gender, category, etc. And then we select the sample(s) from these subgroups. We use this type of sampling when we want representation from all the subgroups of the population. However, stratified sampling requires proper knowledge of the characteristics of the population. For example, a researcher looking to analyze the characteristics of people belonging to different annual income divisions will create strata (groups) according to the annual family income. Eg — less than $20,000, $21,000 — $30,000, $31,000 to $40,000, $41,000 to $50,000, etc. By doing this, the researcher concludes the characteristics of people belonging to different income groups. Marketers can analyze which income groups to target and which ones to eliminate to create a roadmap that would bear fruitful results.

Stratified random Sampling. Ref Image link

Types of Non Probability Sampling

1. Convenience Sampling

Convenience sampling (also known as availability sampling) method that relies on data collection from population members who are conveniently available to participate in study. Facebook polls or questions can be mentioned as a popular example for convenience sampling. Convenience sampling is a type of sampling where the first available primary data source will be used for the research without additional requirements. In other words, this sampling method involves getting participants wherever you can find them and typically wherever is convenient. In convenience sampling no inclusion criteria identified prior to the selection of subjects. All subjects are invited to participate. In business studies this method can be applied in order to gain initial primary data regarding specific issues such as perception of image of a particular brand or collecting opinions of perspective customers in relation to a new design of a product.

This method of data sampling is typically used when the availability of a sample is rare and expensive. It’s also prone to bias, since the sample may not always represent the specific characteristics needed to be studied.

This method has the advantage of being easy to carry out at a relatively low cost in a timely manner. It also allows for gathering useful data and information from a less formal list, like the methods used in probability sampling. Convenience sampling is the preferred method for pilot studies and hypothesis generation.

2. Judgmental or Purposive or selective Sampling

Judgment sampling, which is also known as purposive or selective sampling, is based on the assessment of experts in the field when choosing who to ask to be included in the sample.

In this case, let’s say you are selecting from a group of women aged 30–35, and the experts decide that only the women who have a college degree will be best suited to be included in the sample. This would be judgment sampling.

Judgment sampling takes less time than other methods, and since there’s a smaller data set, researchers should conduct interviews and other hands-on collection techniques to ensure the right type of focus group. Since judgment sampling means researchers can go directly to the target population, there’s an increased relevance of the entirety of the sample.

Judgmental or Purposive or selective Sampling. Ref Link

3. Snow ball Sampling

Existing people are asked to nominate further people known to them so that the sample increases in size like a rolling snowball. This method of sampling is effective when a sampling frame is difficult to identify. There is a significant risk of selection bias in snowball sampling, as the referenced individuals will share common traits with the person who recommends them.

4. Quota Sampling

In this type of sampling, we choose items based on predetermined characteristics of the population. For example, it could be all women in a company, or it could be voters in the age range of 18–22 years in a region, etc. This is a way of collecting samples in a fast way but leaves space for bias.

Understanding data sampling errors

When data sampling occurs, it requires those involved to make statistical conclusions about the population from a series of observations.

Because these observations often come from estimations or generalizations, errors are bound to occur. The two main types of errors that occur when performing data sampling are:

Selection bias: The bias that’s introduced by the selection of individuals to be part of the sample that isn’t random. Therefore, the sample cannot be representative of the population that is looking to be analyzed.
Sampling error: The statistical error that occurs when the researcher doesn’t select a sample that represents the entire population of data. When this happens, the results found in the sample don’t represent the results that would have been obtained from the entire population.

The only way to 100% eliminate the chance of a sampling error is to test 100% of the population. Of course, this is usually impossible. However, the larger the sample size in your data, the less extreme the margin of error will be.

Conclusion

This article will be helpful to understand different sampling methods in machine learning which will save time, reduce cost, convenient, easy to manage and helpful to understand patterns from the data.

Reference: https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/

Connect with me through Linkedin and Medium for new articles and blogs.

— — — * — — — * — — — * — — — * — — — * — — — * — — — * — — —

“Develop a passion for learning. If you do, you will never cease to grow” Anthony J. D’Angelo