Photo by Franki Chamaki on Unsplash

Ace Statistics Step by Step for Data Science

Intro to Statistics; Sampling Techniques — with Interview Q&A

Afrin Sultana
Published in
9 min readMay 21, 2021

--

Why do we need to learn statistics for machine learning?

Statistics help us analyze the data and draw inferences from it, which in turn helps us understand the data. For example, with the help of statistics, we can understand whether our data is skewed or normally distributed or if the data contains outliers. It helps us to detect the mean/median/mode of our data and allows us to see the range within which most data points lie. So, in short, it helps in the EDA part of machine learning which requires lots of data cleaning and also helps in feature engineering.

Statistics can be divided into two parts:

a) Descriptive Statistics: This allows us to analyze and summarize the data with the help of different plots/graphs and tables.

Graphs:

· Box plot

· Histogram

Tabular representation:

· Central Tendency (informs about mean/ median/ mode)

· Standard Deviation

· Variance

· Range of data

b) Inferential Statistics: inferential statistics help us to infer a conclusion from the sample data about the population after performing descriptive statistical analysis on the sample data.

It helps us identify if the sample correctly represents the whole population or not and how confident we are to claim so, with the help of confidence interval.

Also, it is beneficial in choosing among multiple samples from the same population as to which one of them is more accurately describing the population.

We have multiple hypothesis testing method which helps us to draw such kind of conclusions about a population from sample data and those are:

· Null and Alternate hypothesis.

· Z-test

· T-test

· Chi-square test

· ANOVA and ANCOVA test

🎯 What is Population

Population: Population represents a large volume of entity data points which we intend to analyze.

Ex: If we want to find out the average height of all the people of a country, then the height of all the people in the country represents a population.

Image: https://www.omniconvert.com/what-is/sample-size/

🎯 What is Sample

Sample: It is a small collection of data points that are picked up from population data. A good sample can be a close representation of the population. A sample always contains fewer data points than that of a population.

Ex: Suppose I have chosen 1000 people from a country and analyze their average height and then decide about the average height of all the people in the country.

🎯 Why is Sampling Required:

The population contains a huge volume of data, and it is practically impossible to collect that amount of data. Also, even if it is possible, it will be time-consuming. Sampling makes the work easier, and it is less time-consuming and practically possible as, in sampling, we don't choose the whole population. Rather we pick a decent number of elements from the population, which can potentially summarize the population.

Note: Sample should be a close representation of the population.

🎯 How does sampling affect the analysis if not properly done or the right amount of elements are not chosen from the Population?

As we saw, we cannot analyze the whole country’s data, so we chose a small group of people within the country, which can more or less represent the country’s overall population. But we need to be sure that the sample we have chosen is not biased and correctly representing the population; otherwise, the sample will produce an incorrect result. Sample size (number of data points within the sample) also plays a vital role in the overall sampling performance.

We can follow various sampling techniques to reduce biasness and increase accuracy.

🎯 Different Sampling Techniques:

1) Probabilistic Sampling Techniques

2) Non-Probabilistic Sampling Techniques

1. Probabilistic Sampling Techniques

Image: https://www.questionpro.com/blog/probability-sampling/

i) Simple random sampling: In this sampling process, we randomly choose the data points from a population to create a sample, and that is why every data point gets an equal chance to be selected for the sample. If we don’t have any prior idea of the population, this can be an easy but effective way for sample creation.

Simple random sampling (Image: By Author)

Here I am choosing randomly and creating a sample that can more or less summarize the population.

Ex: Suppose you want to buy as many candies as possible which are available in a shop within one minute. In this situation, we can start randomly picking as many candies as possible, and while picking, we are not following any rules, and that’s how random sampling works.

Problem: If the population comprises of heterogeneous elements (ex: male, female, old, young, student, professionals, etc.) and one of the elements (suppose male) is relatively more in numbers than the other ones present in the population, then there might be a chance of biases in the sample, and the sample may not be a good representative of the population. In our case, from the above example, suppose there are too many red candies compared to other colors, then the probability of selecting red candy is larger than others which causes the problem.

ii) Systematic Sampling: This sampling technique is very similar to random sampling, but the only difference is, instead of selecting data points randomly, it selects the 1st data point at random and then chooses the next ones at regular intervals. This implies in this technique that the data points are selected in a systematic pattern.

Systematic Sampling (Image: By Author)

Ex: Suppose we are choosing few alphabets from A to Z. Now if we intend to pick alphabets that are placed at a position of multiple of 5 and we randomly start picking from the letter “B” followed by the regular intervals, i.e., B, G, I, etc. This kind of sampling shows a pattern, and that is why it is called systematic sampling.

Problem: Suppose a population contains both male and female data, and all the female data are placed at an even position, and men's data are in an odd position. In that case, if we apply systematic sampling and choose every even data point of the population for sample creation, then the sample would be highly biased and won’t be able to summarize the overall population.

iii) Stratified Sampling: This sampling technique is very effective when we have a population with different varieties. Here, we divide the population into small groups called strata based on different categories (ex: age, gender, qualification, hobbies, etc.), and then we apply random or systematic sampling on each strata and pick some elements from within those strata. The number of elements to be picked up from strata is decided by the ratio of the volume of elements present in each strata. We then combine the elements collected from different stratas and make a perfect sample with all the variety.

Stratified Sampling (Image: By Author)

Problem: This process is time-consuming and challenging, but it can provide precision. We need to identify the number of strata we want, and if the data is hugely diverse, this might become a tedious job.

iv) Cluster Sampling: It can be of two types-

a) Single Staged Cluster Sampling: In this kind of sampling, the population is divided into subgroups that are as diverse as the population itself. Unlike strata, clusters are heterogeneous in nature and closely represent the population. Once the clusters are formed, then by applying random or systematic sampling, we pick random clusters and study the cluster to identify whether it represents the population closely or not and then choose the best representative of the population to be the sample of the study. We might use different hypothesis testing techniques to select the best cluster which closely represents the population.

Single staged Cluster Sampling (Image: By Author)

b) Double Staged Cluster Sampling: In this clustering technique, after random/ systematic shortlisting of the clusters, we do not choose the whole best fit cluster to be the sample; instead, we choose few elements from each shortlisted cluster and form the sample.

Double Staged Clustering (Image: By Author)

2.Non-Probabilistic Sampling Techniques

Image: https://www.canstockphoto.com/nonprobability-sampling-11370870.html

i) Convenience Sampling: This type of sampling is based on convenience. This means whatever data is easily accessible, we rely on that for our study. This type of sampling is easy but not reliable and cannot summarize the whole population closely.

Ex: Suppose we are going to open a dance academy in an area, and we want people to help us select what types of dances they want in the very academy. We ask the people sitting in the front row and take their opinion. This type of data is convenient for a data collector to collect, but it does not represent the whole population.

ii) Voluntary Sampling: It is similar to that of convenience sampling but here, we don’t choose the elements for sampling; rather, the entities become a volunteer & participate in sampling.

Ex: For the very same dance academy, if we want to try “Voluntary Sampling”, then, in that case, few enthusiasts will come forward and share their opinion towards some dance forms, and we tend to summarize what people want in that area. The problem with this technique is, it can be highly biased and show only a part of what the actual population wants. Because if the non-volunteer people are more in number within the population, that implies we are losing on major opinion which can affect the result.

iii) Purposive or Judgmental Sampling: In this kind of sampling, the sample is chosen by the ones performing the analysis, and there can be a high chance of preference. Secondly, this sampling is very much study-specific.

Ex: Again, for the same dance academy, if we try the “Purposive or Judgmental” approach, we will ask people about their affinity towards contemporary dance forms. So, in this type of sampling, only those who like contemporary are selected, and the rest are excluded from the sample, which might not provide real insight regarding what people think about other dance forms.

iv) Snowball Sampling: In this kind of Sampling, we first pick a random element from a population and then allow that element to nominate the next element to be a part of our sample. In this method, the sample size grows like a rolling snowball, which is why it is called snowball sampling. This type of sampling technique can be used when we cannot determine any pattern within the population.

Ex: Now, for the very same dance academy, if we choose the “Snowball Sampling” approach, then we can pick the first person from the population randomly who favors contemporary dance and ask him/her to select the next person who shares the same interest. This works with a chain referral structure, and with every referral, the sample size increases.

Key Outcomes

For every aspiring Data Scientists, Statistics is the first stepping stone. To have a clear and in-depth understanding of statistics, everyone must explore its various concepts. Every detailed model in Machine Learning or Data Science begins with a suitable sample. There is no right or wrong sampling technique per se. It always depends on what kind of problem we are dealing with. Understand the problem statement, the given data and then based on the situation choosing the right sampling techniques gives us an edge to build an efficient model. So, in this article, we have explored various sampling techniques and their pros & cons.

Upcoming Article:

Confidence Interval; Outliers; Central Limit Theorem; Normal Distribution; Standard Normal Distribution

Keep learning, Keep growing! 😃

Connect with me — https://www.linkedin.com/in/afrinsultana2404/

--

--

Afrin Sultana
Data Arena

Passionate and Selfdriven working professional with a keen interest in Data Science. Trying to contribute in my way while still diving deep in the world of Data