Sampling
In Machine Learning we often need to work with very large datasets, which sometimes may be computationally expensive. During these times, it makes more sense to create a smaller sample of this large dataset and train or models in this smaller dataset. While doing this it is important to ensure that we do not lose statistical information about our population. We also need to ensure that out sample is not biased and is a representative of our population. We explore some methods to ensure this.
For the purpose of this notebook document we will work with California House Dataset.
We will take two approaches at this juncture:
1. Simple Random Sampling
- This is fairly easy to achieve and is the most direct method of probability sampling.
- There is a risk of introducing sampling bias.
- To be more confident of the sample, statistical tests may be performed on each of the features of the dataset.
2. Stratified Random Sampling
- Ensures the sample is a representative of the whole population.
- Subpopulations or strata are defined and simple random samples are generated from each subpopulation.
- This approach reduces the sampling error.
Simple Random Sampling
We use pandas.DataFrame.sample to get a simple random sample. It returns a random sample of items from an axis of object.
To ensure our sample does not lose statistical significance with respect to the population, we conduct some statistical tests. For an easier implementation, we make an acceptable assumption: Consider each variable (feature/ column) independently from the others.
For each feature we compare the probability distribution of the sample with that of the population. If all them are significant then the sample “Passes our Test” else we retry with another sample.
We use Kolmogorov-Smirnov test.
To conduct these tests we use the scipy library, which is an Open Source Python library, which is used in mathematics, engineering, scientific and technical computing.
We see that all the columns have a p-value > 0.05 and hence we cannot reject the Null Hypothesis that they come from different distributions, implying sample is statistically significant.
Stratified Random Sampling
In Stratified Random Sampling it is important to choose a strata or the subpopulation. The most optimal way to do it is to choose the feature which is most important (highest correlation with the target variable) and stratify the population on the basis of this feature.
So we see in this example that median_income has highest correlation and we choose this feature to stratify the dataset. For this we first need to create a new column to create the strata.
All we did above is create 5 strata (or subpopulations) on the basis of which we will sample.
We see that all the columns have a p-value > 0.05 and hence we cannot reject the Null Hypothesis that they come from different distributions, implying sample is statistically significant.