# What is the Sampling Bias?

Drawing a valid conclusion depends heavily on how you collect data.

In many statistical analysis and data-driven decisions, we need to draw an actionable conclusion, supported by the data collected already or needed to be collected. **However, the quality of conclusion derived from the data heavily depends on the quality of the data you collected. **During data collection, you sample data usually from a larger data set. In most of cases, you do not have luxury to collect the data from all cases and hence measure the intended metrics truely. However, you can infer some characteristics of the larger dataset (but sometimes theoretical or imaginary), usually known as *population, *via smaller subset of the data, known as *samples*. Samples are also used in tests of various sorts (e.g., comparing the effect of web page designs on clicks). The population usually follows an unknown distribution.** Using the empirical distribution inferred from the samples, we aim at estimating the population distribution or related statistics.**

# Sampling

A sample is a subset of data drawn form a large data set, population. Data quality often matters more than the data quantity. Thus, acquiring quality data is a critical procedure in drawing valid conclusion since many errors can lurk into our statistical model and experiment leading to an invalidate result. In the following, we review some of the important and common biases in sampling.

## Statistical Bias

Statistical bias refers to measurement or sampling errors that are systematic and produced by the measurement or sampling process. **An important distinction should be made between errors due to random chance and errors due to bias.** In random chance the bias is as a result of measurement error and is not skewed in a specific direction and usually is independent for each sample. However, for the error due to bias, the error is somehow correlated and have a centric tendency around a fixed value. For example, all samples have an added error of a constant. This type of bias can be an indicator that a statistical or machine learning model has been misspecified, or an important variable left out [1].

In order to combat this type of bias, many sampling methods are proposed but at the heart of all of them lies the r*andom sampling*. **Random sampling is the process in which each available member of the population has the same chance to be selected at each draw. **Sampling can be done through replacement, in which each selected sample is included again in population for each draw or it can be without replacement in which once a member of population is selected, it is not available for the next draw.

Sometimes, even though the sampling process is random, such as random sampling, the sampling might not reflect the true population. Imagine the case that the population consists of two types of members, called type I (99%) and type II (1%). If we conduct a random sampling the chance that the type II member is included in our sampling is very negligible. In order to preserve the proportionate in the sampling, one method of sampling is *stratified sampling*. In *stratified sampling*, the population is divided up into *strata*, and random samples are taken from each stratum. By doing this, each member of types has the same chance as the one in population data.

## Selection Biases

Selection biases are forms of statistical biases where each member of the population is chosen selectively — consciously or unconsciously- which leads to misleading conclusion.

One of selection biasis is *data snooping*. Data snooping happens when you mix the real pattern with what is supposed to be noise. This is mostly frequent when you examine the data and try to discern patterns without checking if it is repeatable. If the pattern you discovered is not provable in many cases or repeatable experiments, it is considered as random phenomenon which is usually called noise. This is the most common case especially when the data size is small. As mentioned in a readable book by Daniel Kahneman “Thinking Fast and Slow”,

**“Random process produces many sequences that convince people that the process is not random at all.”**

There is another form of the selection bias which is of particular interest especially to the data analysis. This is a form of selection bias which is what John Elder (founder of Elder Research, a respected data mining consultancy) calls it as *vast research effect*. If you repeatedly run different models and ask different questions with a large data set, you are bound to find something interesting. But is the result you found truly something interesting, or is it the chance outlier? [1] We can minimise this kind of bias by using the holdout set to test the predictability of our model. Elder also suggest using what he calls as target shuffling which in essence is permutation test.

Other forms of selection biases can be as result of cherry-picking data, selection of time intervals that accentuate a particular statistical effect and stopping an experiment when the results look “interesting.”

# Conclusion

In this short article, we aim at understanding the common statistical biases in sampling and statistical inference applications. Some approaches are proposed to mitigate the biases to improve the quality of conclusion.

Reference

[1] Bruce, Peter, Andrew Bruce, and Peter Gedeck. *Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python*. O’Reilly Media, 2020.