--

# Introduction

Poverty Line can be defined as the minimum amount of income needed by a person to fulfill their basic necessities in a particular region or country. The amount of people with income less than this number, hence living under the poverty line, is a very useful indicator of how well the economy is doing. When a certain region or country is found to have high percentage of people living under poverty line, the government should immediately take actions to resolve this issue. Additionally, this parameter should also be re-evaluated frequently to observe the overall trend, whether it is consistently increasing, decreasing, or fluctuating.

The main challenge in obtaining this data will be from the huge population size. For example in a country like Indonesia, with the total population of more than 270 millions which are spread out across several thousand islands, it is impossible to obtain the data from every citizen. Therefore, the solution will be to use a sampling method to obtain data from just a fraction of the total citizen which can be used to represent the whole population.

There are several types of common sampling method such as simple random sampling, stratified sampling, cluster sampling, systematic sampling, etc. Choosing the appropriate sampling method is crucial to ensure that we are getting the most accurate and representative samples as well as minimizing the cost and time needed to obtain the samples. During sampling process, caution should also be exercised to prevent any bias from occurring. This can be achieved by fully understanding the main purpose of the research and what are the aspects that could possible affect them.

# Sampling Process

For this case study, let’s assume we are trying to estimate the proportion of people living under poverty line in the province of North Sumatra in Indonesia. The dataset used can be obtained here.

The population can be divided into 2 levels:

• Level 1: The cities that are located within the North Sumatra province
• Level 2: The people living in those cities

Since poverty line is defined based on income, the parameter that we are interested in is the income of each person. Afterwards, we are going to calculate the proportion of those found living under poverty line compared to the total population.

In this case, we are going to use a combination of stratified sampling and systematic random sampling. The stratified sampling is the first stage where we divide the population (province) into individual cities. This method is chosen because we would like to make sure that each cities is properly represented in the study. Additionally, each city will have different number of population, therefore, we will use proportional allocation to define the sample size taken from each cities.

The second stage is the systematic random sampling where we will randomly select people from each cities with each person within the same city having the same probability to be selected. This method is chosen to ensure that there is no bias in the demographic of the people chosen as the sample. Alternatively, if we were to choose to use cluster sampling based on certain housing complex, there might be a bias where people living in the same housing complex usually have similar economy status. Therefore, even though cluster sampling might be easier and cost less, it is not suitable for this case as it might not represent the whole demographic accurately.

After deciding the appropriate sampling method, the next step is to calculate the sample size needed. This can be calculated using the following formula:

where:

n = sample size

d = margin of error

N = population size

Z = Z score

σ² = variance

Alternatively, if variance is unknown, the Cochran’s sample size formula can be used:

where:

n = sample size

Z = Z score

p = population proportion estimate

ε = margin of error

if it is known from previous year data that it is estimated that 8.75% of the population is estimated to leave under poverty line, and we would like to have 95% confidence level, and 5% margin of error. The sample size can be calculated with the code below:

`from scipy import statsstat_z = stats.norm.ppf(q = 1-0.05/2)margin_of_error = 0.05proportion = 0.0875n = (stat_z**2)*proportion*(1-proportion)/(margin_of_error**2)`

It is calculated that the minimum sample size is 123 people. Additional information known is that there are a total of 33 cities and approximately 15 millions of total population in our province of interest. With this information in mind, it could immediately be deduced that the minimum sample size of 123 people will be too small of a sample size as each cities will most likely only have less than 5 sample size. Such a small sample size per city will be highly prone to bias.

In this case, it is actually a good idea to increase the number of sample size since 123 is only the minimum. Increasing the sample size will only help us improve the accuracy of our data. In real world application, the higher the sample size, the more confident we’ll be with the results. The actual limit will be based on how much the cost is going to be compared to the available budget. Continuing our case study, let’s increase our sample size to 10000 so each each city has approximately several hundred samples.

Since the population in each city is different, we need to use proportional allocation to determine sample size for each city. Cities with larger population should contain more samples than those with smaller population. Sample size for each city can be calculated using the following formula:

where:

nₕ = strata sample size

n = total sample size

Nₕ = strata population size

N = total population size

Sample size for each strata (city) can be calculated using the following code

`import pandas as pdimport numpy as npdataset = pd.read_csv("poverty dataset.csv")total_pop = dataset["Total population"].sum()tot_sample_size = 10000# calculate sample size for each stratadataset['sample size'] = tot_sample_size*dataset['Total population']/total_pop# round up the sample size to the nearest integerdataset['sample size'] = dataset['sample size'].apply(np.ceil)`

Now that the sample size needed from each city have been calculated, the actual data gathering process can be started. Remember that the people chosen as the sample have to be completely random. They should not be a group of people who worked in the same place or live in the same area as it might result in bias. One of the ways to achieve this might be by making random selections from a list of ID numbers of people living in each city.

From those randomly generated ID numbers, the address and contact information of the chosen people can then be obtained. The income data can be gathered either by visiting them in person, sending a survey through mail, or through phone call. The suitable data gathering method can be different for each person depending on their lifestyles. In some case, more than one method might be needed for the same person in order to ensure we managed to extract the required information.

From the income data gathered, the proportion of people living under poverty line can be calculated for each city. The table below shows an example of how the data will look like with the first few cities:

Afterwards, the poverty proportion for the whole province can be calculated as shown with the following code:

`import pandas as pddataset = pd.read_csv("poverty dataset.csv")total_pop = dataset["Total population"].sum()# calculate total number of poverty for each strata based on the proportion information from sampledataset['num_of_poverty'] = dataset['Percentage of poverty']/100*dataset['Total population']# calculate total number of poverty in the whole populationtotal_poverty = dataset['num_of_poverty'].sum()poverty_proportion = total_poverty/total_pop*100`

# Conclusion

A combination of stratified and simple random sampling can be used to obtain the economy status of people within a specific region. Stratifying each cities will ensure that every one of them are well represented. Sample size allocation is also following the proportional allocation principle. This way, we can ensure that the influence of data obtained from each cities is proportional to the size of population

When selecting the people we’d like to get information from, it is best to avoid cluster sampling and use simple random sampling instead. This is because people who are within the same cluster, whether it be cluster based on place of living or place of work, might have similar economy status. This will result in a bias and consequently reducing the accuracy of the research. On the other hand, using simple random sampling might be more costly and time consuming as the target will be more spread out, some possibly being in a more rural, difficult to reach areas. However, in this case it is necessary to ensure the most accurate result.

When it comes to determining sample size, other than using the standard formula, personal discretion will also be needed. For example with our case, even though it was calculated that 123 samples is enough based on Cochran’s sample size formula, we also know that we need to divide the samples into 33 cities. Having such a small number of sample for each city is not a good idea since it will not be representative of the whole city. Therefore, we can choose to increase the sample size beyond what has been calculated.

Lastly, based on the poverty proportion obtained from the samples, we can estimate the total number of poverty in each cities by multiplying the proportion with the total population of the city. Afterwards, the total poverty or poverty proportion of the whole province can be calculated.

# Reference:

https://www.statisticshowto.com/probability-and-statistics/find-sample-size/