Statistics For Data Science with Python — Sampling (1/10)

Andre Vianna

Published in

My Data Science Journey

9 min readNov 6, 2021

Let’s Show Sampling Techniques

A little bit of the Statistical Data Science Journey

source: https://ashutoshtripathi.com/statistics-for-data-science/

Below are some statistical sampling techniques

Steps involved in Sampling

I firmly believe visualizing a concept is a great way to ingrain it in your mind. So here’s a step-by-step process of how sampling is typically done, in flowchart form!

Let’s take an interesting case study and apply these steps to perform sampling. We recently conducted General Elections in India a few months back. You must have seen the public opinion polls every news channel was running at the time:

Step 1

The first stage in the sampling process is to clearly define the target population.

So, to carry out opinion polls, polling agencies consider only the people who are above 18 years of age and are eligible to vote in the population.

Step 2

Sampling Frame — It is a list of items or people forming a population from which the sample is taken.

So, the sampling frame would be the list of all the people whose names appear on the voter list of a constituency.

Step 3

Generally, probability sampling methods are used because every vote has equal value and any person can be included in the sample irrespective of his caste, community, or religion. Different samples are taken from different regions all over the country.

Step 4

Sample Size — It is the number of individuals or items to be taken in a sample that would be enough to make inferences about the population with the desired level of accuracy and precision.

Larger the sample size, more accurate our inference about the population would be.

For the polls, agencies try to get as many people as possible of diverse backgrounds to be included in the sample as it would help in predicting the number of seats a political party can win.

Step 5

Once the target population, sampling frame, sampling technique, and sample size have been established, the next step is to collect data from the sample.

In opinion polls, agencies generally put questions to the people, like which political party are they going to vote for or has the previous party done any work, etc.

Based on the answers, agencies try to interpret who the people of a constituency are going to vote for and approximately how many seats is a political party going to win. Pretty exciting work, right?!

Different Types of Sampling Techniques

Here comes another diagrammatic illustration! This one talks about the different types of sampling techniques available to us:

I) Probability Sampling: In probability sampling, every element of the population has an equal chance of being selected. Probability sampling gives us the best chance to create a sample that is truly representative of the population

II) Non-Probability Sampling: In non-probability sampling, all elements do not have an equal chance of being selected. Consequently, there is a significant risk of ending up with a non-representative sample which does not produce generalizable results

For example, let’s say our population consists of 20 individuals. Each individual is numbered from 1 to 20 and is represented by a specific color (red, blue, green, or yellow). Each person would have odds of 1 out of 20 of being chosen in probability sampling.

With non-probability sampling, these odds are not equal. A person might have a better chance of being chosen than others. So now that we have an idea of these two sampling types, let’s dive into each and understand the different types of sampling under each section.

I) Types of Probability Sampling

1) Simple Random Sampling

This is a type of sampling technique you must have come across at some point. Here, every individual is chosen entirely by chance and each member of the population has an equal chance of being selected.

Simple random sampling reduces selection bias.

One big advantage of this technique is that it is the most direct method of probability sampling. But it comes with a caveat — it may not select enough individuals with our characteristics of interest. Monte Carlo methods use repeated random sampling for the estimation of unknown parameters.

2) Systematic Sampling

In this type of sampling, the first individual is selected randomly and others are selected using a fixed ‘sampling interval’. Let’s take a simple example to understand this.

Say our population size is x and we have to select a sample size of n. Then, the next individual that we will select would be x/nth intervals away from the first individual. We can select the rest in the same way.

Suppose, we began with person number 3, and we want a sample size of 5. So, the next individual that we will select would be at an interval of (20/5) = 4 from the 3rd person, i.e. 7 (3+4), and so on.

3, 3+4=7, 7+4=11, 11+4=15, 15+4=19 = 3, 7, 11, 15, 19

Systematic sampling is more convenient than simple random sampling. However, it might also lead to bias if there is an underlying pattern in which we are selecting items from the population (though the chances of that happening are quite rare).

3) Cluster Sampling

In a clustered sample, we use the subgroups of the population as the sampling unit rather than individuals. The population is divided into subgroups, known as clusters, and a whole cluster is randomly selected to be included in the study:

In the above example, we have divided our population into 5 clusters. Each cluster consists of 4 individuals and we have taken the 4th cluster in our sample. We can include more clusters as per our sample size.

II) Types of Non-Probability Sampling

1) Convenience Sampling

This is perhaps the easiest method of sampling because individuals are selected based on their availability and willingness to take part.

Here, let’s say individuals numbered 4, 7, 12, 15 and 20 want to be part of our sample, and hence, we will include them in the sample.

Convenience sampling is prone to significant bias, because the sample may not be the representation of the specific characteristics such as religion or, say the gender, of the population.

2) Quota Sampling

In this type of sampling, we choose items based on predetermined characteristics of the population. Consider that we have to select individuals having a number in multiples of four for our sample:

Therefore, the individuals numbered 4, 8, 12, 16, and 20 are already reserved for our sample.

In quota sampling, the chosen sample might not be the best representation of the characteristics of the population that weren’t considered.

3) Judgment Sampling

It is also known as selective sampling. It depends on the judgment of the experts when choosing whom to ask to participate.

Suppose, our experts believe that people numbered 1, 7, 10, 15, and 19 should be considered for our sample as they may help us to infer the population in a better way. As you can imagine, quota sampling is also prone to bias by the experts and may not necessarily be representative.

4) Snowball Sampling

I quite like this sampling technique. Existing people are asked to nominate further people known to them so that the sample increases in size like a rolling snowball. This method of sampling is effective when a sampling frame is difficult to identify.

Here, we had randomly chosen person 1 for our sample, and then he/she recommended person 6, and person 6 recommended person 11, and so on.

1->6->11->14->19

There is a significant risk of selection bias in snowball sampling, as the referenced individuals will share common traits with the person who recommends them.

Now let’s code…

1. Loading a Data Set

import pandas as pdimport randomimport numpy as npdataset = pd.read_csv('census.csv')dataset.shapedataset.head()

Data Set Repository:

My-Data-Science-Journey/census.csv at main · viannaandreBR/My-Data-Science-Journey

My Data Science Journey with Pytho, Pycaret, StreamLit, Pandas, Airflow, MinIO, - My-Data-Science-Journey/census.csv at…

github.com

2. Simple Random Sampling

df_amostra_aleatoria_simples = dataset.sample(n = 100, random_state = 1)
df_amostra_aleatoria_simples.shapedf_amostra_aleatoria_simples.head()def amostragem_aleatoria_simples(dataset, amostras):return dataset.sample(n = amostras, random_state=1)df_amostra_aleatoria_simples = amostragem_aleatoria_simples(dataset, 100)df_amostra_aleatoria_simples.shapedf_amostra_aleatoria_simples.head()

3. Systematic Sampling

dataset.shape
len(dataset) // 100random.seed(1)random.randint(0, 170)np.arange(68, len(dataset), step = 325)def amostragem_sistematica(dataset, amostras):intervalo = len(dataset) // amostrasrandom.seed(1)inicio = random.randint(0, intervalo)indices = np.arange(inicio, len(dataset), step = intervalo)amostra_sistematica = dataset.iloc[indices]return amostra_sistematicadf_amostra_sistematica = amostragem_sistematica(dataset, 100)df_amostra_sistematica.shapedf_amostra_sistematica.head()

4. Cluster Sampling

¨len(dataset) / 10
dataset.shape
grupos = []id_grupo = 0contagem = 0for _ in dataset.iterrows():grupos.append(id_grupo)contagem += 1# if contagem > 3256:if contagem > 853:contagem = 0id_grupo += 1print(grupos)np.unique(grupos, return_counts=True)np.shape(grupos),print('Hello'),dataset.shapenp.shape(grupos), dataset.shapedataset['grupo'] = gruposdataset.head()random.randint(0, 9)df_agrupamento = dataset[dataset['grupo'] == 7]df_agrupamento.shapedf_agrupamento['grupo'].value_counts()def amostragem_agrupamento(dataset, numero_grupos):intervalo = len(dataset) / numero_gruposgrupos = []id_grupo = 0contagem = 0for _ in dataset.iterrows():grupos.append(id_grupo)contagem += 1if contagem > intervalo:contagem = 0id_grupo += 1dataset['grupo'] = gruposrandom.seed(1)grupo_selecionado = random.randint(0, numero_grupos)return dataset[dataset['grupo'] == grupo_selecionado]#len(dataset) / 325len(dataset) // 325df_amostra_agrupamento = amostragem_agrupamento(dataset, 325)df_amostra_agrupamento.shape, df_amostra_agrupamento['grupo'].value_counts()df_amostra_agrupamento.head()

5. Stratified Sampling

from sklearn.model_selection import StratifiedShuffleSplitdataset.shapedataset['income'].value_counts()dataset['income'].head#7841 / len(dataset), 24720 / len(dataset)6495 / len(dataset), 2034 / len(dataset)#@title Texto de título predefinidosplit = StratifiedShuffleSplit(test_size=0.1)for x, y in split.split(dataset, dataset['income']):#for x,y in split.split(dataset, 100):df_x = dataset.iloc[x]df_y = dataset.iloc[y]df_x.shape, df_y.shapedf_y['income'].value_counts()def amostragem_estratificada(dataset, percentual):split = StratifiedShuffleSplit(test_size=percentual, random_state=1)for _, y in split.split(dataset, dataset['income']):df_y = dataset.iloc[y]return df_ydf_amostra_estratificada = amostragem_estratificada(dataset, 0.0030711587481956942)df_amostra_estratificada.shape

6. Reservoir Sampling

stream = []for i in range(len(dataset)):stream.append(i)def amostragem_reservatorio(dataset, amostras):stream = []for i in range(len(dataset)):stream.append(i)i = 0tamanho = len(dataset)reservatorio = [0] * amostrasfor i in range(amostras):reservatorio[i] = stream[i]while i < tamanho:j = random.randrange(i + 1)if j < amostras:reservatorio[j] = stream[i]i += 1return dataset.iloc[reservatorio]df_amostragem_reservatorio = amostragem_reservatorio(dataset, 100)df_amostragem_reservatorio.shapedf_amostragem_reservatorio.head()dataset['age'].mean()

7. Compare all Samples

dataset['age'].mean()df_amostra_aleatoria_simples['age'].mean()df_amostra_sistematica['age'].mean()df_amostra_agrupamento['age'].mean()df_amostra_estratificada['age'].mean()df_amostragem_reservatorio['age'].mean()

Full Code with Python

Another Code with Python

Randon Sampling
Stratified Sampling
Stratified Sampling
Reseervoir Sampling