Statistically Speaking : Various Sampling Methods

Antarip Giri
CodeX
Published in
7 min readJun 1, 2024
Statistically Speaking

This is part 2 of my series Statistically Speaking (Part 1), where I talk about statistics and how it plays an important and significant role to better understand Data Science topics.

In this tutorial we will talk about the following:

  1. Case study where simple random sample is not feasible.
  2. The different sampling methods apart from simple random samples.
  3. Sample of Convenience.
  4. Quota Sampling
  5. Subjective Sampling.
  6. Snowball Sampling.
  7. Stratified Sampling
  8. Systematic Sampling
  9. Clustered Sampling
  10. What is bias in sampling?

Is simple random sampling always feasible?

Lets understand the concepts of sampling other than simple random sampling by first looking into case study and then defining the various sampling methods.

Case Study-1: A electric vehicle (EV) mass manufacturing company receives a package of 100,000 battery packs for their four wheeler electric vehicle at once. The company that was contracted with the order sent in their specification that all battery packs have 24 kWh ± 0.5 % for the whole batch of the battery pack. The battery pack are stacked in a vertical column of 100 with a row of 1000 at the time of unload and delivery in the company’s storage facility. If a quality control engineer of the vehicle company wants to verify the quality of the battery. Is it possible to pick random samples of the battery individually to check? It would be impossible to access the middle part of the battery pack pile. Hence what he would do is take 1000 battery from the top of each row and do a quality testing.

The sampling discussed above is called a sample of convenience. It is often the easiest way to sample and the study is at the discretion of the team responsible for sampling. However, this may lead to significance bias in many cases. How? Well let see at another example.

Case-Study 2: Instead of battery packs, lets say we can doing a human study. We need volunteers for the study. Instead of looking for random volunteers we pick and chose the people we know and are willing to participate. This may lead to significant bias as people who participated in the study might have different characteristics that people missing from the study.

**This significant bias is always a risk when doing studies which are non-random also called non-probabilistic sampling methods**

Case Study -3: A TV production company wants to interview 100 people about their responses for their show. The show is targeted primarily towards young teen audiences. Hence, the company chooses their people, 80 people from the age group of 18–30 and 10 people from 16–18 and 10 people from 30–40 age groups. In this case it makes sense to sample this way rather than a simple random sample.

The study above predefined the sample we are interested in from the population. This makes it exposed to significant bias. The type of sampling defined above (Case Study 3) is called quota sampling. Here, the team is given a predefined number of subjects that is needed from the population.

Case Study -4: A deep learning research team is trying to detect lung disease lets say, emphysema from high resolution CT (HRCT) images. The research team decides to take their CT images from a specialty lung disease hospital. As the team feels they would get the maximum diseased lungs’ CT. This is a educated decision to teach their deep learning model to detect a particular disease. However, as the hospital only deals with critical lung cases the lack of healthy or mild CT images causes a huge selection bias in the model learning. But, it becomes quite difficult to conduct CT scan for healthy lungs due to various reasons, financial or otherwise.

This type of sampling is called subjective, selective sampling where the researcher (team) decides which data they want to sample or a representative sample to fulfill their study needs or specifically approach CT which have certain useful characteristics. This study leads to errors from the judgement bias of the team and not necessarily representative. This also falls under non-random sampling technique.

Case-Study 5: A fish-food company wants to test their product on a variety of fishes to study its market and viability. The team contacts a fish farmer who has a farm nearby. He takes part in the study and the team asks him to recommend fellow fish farmers. The next batch of fish farmers are requested to recommend fellow farmers. This seems to be a very useful way of sampling for the study. However, the recommendations from the previous sample might be farmers who farms a similar kind of fishes. This leads to a selection bias. Now, why this way of sampling? Sometimes its difficult to reach farmers and agree them to take part in the study due to various reasons hence personal recommendations are a safe bet.

This type of sampling is called snowball sampling where the samples keeps on increasing in the same direction and lacks randomness. It is also a non-random sampling technique.

Case-Study 6: A cement company produces 100 bags of cement in an hour from 10 different units in a factory. Instead of choosing 100 bags for quality assurance from the 1000 bags an hour. They quality assurance engineer clusters 10 bags (chosen from 100 bags randomly of each unit) produced in the same hour. So if a factory runs for 10 hours. The whole day’s production is clustered into 10 clusters and testing is done as a cluster rather than individually. This way the cement manufacturer knows their production quality at different times of the day and can make necessary changes.

This type of sampling is called single-stage clustered sample where the clusters’ outcomes are considered as one sample. A slight variation would be a double stage clustering where a random item form the clustered samples is also included in the study. This type of sampling becomes quite important in cases where the population is too large and spread reducing the feasibility of a simple random sampling.

Case-Study 7: A battery pack manufacturer has 3 plants where they manufacture the same model (model IS-246). The plant in city A manufactures 100 battery packs, city B plant manufactures 200 battery packs and city C plant manufactures 300 battery packs. Now if the company headquarter situated in city D wants to check for quality. It would be beneficial if they randomly select battery specifications proportional to the production capabilities of each plants. As this would give a better representation of the workings of the plant. The team can choose 10 from city A, 20 from city B and 30 from city C (maintaining the same proportion as production). This prevents over representation of higher production plants.

This type of sampling is called stratified sampling. This reduces sampling bias and increases the sampling accuracy. However, the sampling frame (proportion) must be a prior knowledge to the team.

Case-Study 8: A factory produces 1000 pieces of the same resistor (200 k-ohm) in two separate batches in a day. The resistors are then mixed and stacked together. Once they are mixed it is impossible to know which resistor was produced in which batch. The team needs 100 resistors a day to check their quality. Thus the team decides to check every 10th resistor and check its thermal property and resistance.

This type of sampling is called systematic sampling. To select z number of sample from a population of n the team selects z/n th member of the population. This is one of the easiest and straight forward method of probability sampling. However, this leads to no-adequate representation of certain characteristics. This also may lead to certain bias towards underlying patterns if the frequency of the pattern lies in the same ratio as the sampling frame.

In the past eight case studies we have been talking about a phenomenon called bias.

What is bias in sampling?

It can defined as the measure by which the distance between mean statistics and the estimated parameter. In short, its the errors that occurs when we start estimating rather than using probability. The reciprocal of bias is precision which is the measure of the closeness of the estimator to the real parameter value. There are a lot of sources of bias in sampling methods:

  • Replacing selected elements due to lack of convenience or damages to the member, omission of members.
  • The participants for the study is low.
  • A non-updated list of population of members.
  • A pre-defined rules of sampling which is at the discretion of the team.

In this tutorial, we looked at various case study corresponding to sampling methods,both random or probabilistic and non-random or non-probabilistic. Then we discussed bias and precision in a brief detail to understand the meaning of the terms.

In the next tutorial, we will discuss about the types of population and then discuss some ways to understand our data sampled a little bit better.

Also, Thank you for reading so far.💪 Have a small tea or coffee break.😃If you like👍 the article do like and comment👀 if you need any specific blog post regarding this. Also, feel free to share with your friends, colleagues, students who you feel might benefit from this article.

--

--

Antarip Giri
CodeX
Writer for

A Machine Learning and Data Science practictioner who talks about all things data .