Do we care enough about bias?

- Introduction/What is bias?
- Survivorship bias — a brief story
- Sample bias
- Non-technical overview of Bias-Variance trade-off in Machine Learning models
Introduction/What is bias?
In short, we can define bias as the situation in which the information received is not the real one, but the one extracted from a particular point of view. The problem is that it can cause humans to be directed without realizing to a way of thinking that we might not have reached if we had all the information.
It is important to remember that bias can be intentional or unintended.
Survivorship bias — a brief story
In order to explain the first type of bias I am going to discuss, I will summarize a story that occurred a long time ago.
During World War II, Abraham Wald (statistician of the Statistical Research Group), worked applying statistical methods to different war problems. One of the problems he faced was trying to determine which part of the planes had to be reinforced, in order to reduce the number of planes shot down.
Following the observation of hundreds of warplanes, it was seen that most of the planes had a higher concentration of shots on the wings and body of the plane, as well as other parts of the plane. However, the engines saw few or no impacts.
Because of this, those parts of the aircraft were reinforced but the results remained the same or even more evident.

At this moment they realized. It was the Survivorship bias!! Contrary to what was thought, the engines had to be reinforced, not the wings and the body. Why? The planes that were being analyzed were only those that arrived back and were not shot down. Because of the Survivorship bias, we were forgetting a large part of the aircraft population in the sample that was analyzed, since those analyzed had gone through a ‘pre-selection’ process (coming back from battle).
Sample bias
A historical example would be the result of the American elections of 1936. Literary Digest magazine collected information on more than 2 million people through postal surveys (which is actually a HUGE sample).
It was strongly predicted that President Alf Landon was going to win the elections broadly against Franklin Roosevelt. Which was the result? Exactly the opposite. Why? Because the sample was poorly selected! The sample was totally unbalanced, so it differed from the population it was intended to represent. The vast majority of the people who filled out this form were readers of the magazine, supplemented by records of the registered automobile and telephone users.
Because of this, there was an over-representation of wealthy people, which as a group, are more likely to vote for the Republican candidate.
Curiously, a poll made after that with less than 200 individuals correctly sampled, predicted the correct outcome of the elections. Isn’t it awesome? This might give us an idea of how important it is selecting the data before going towards anything else.
Non-technical overview of Bias-Variance trade-off in Machine Learning models
What bias-variance trade-off actually means?
Normally, as the flexibility of our function (f̂ ) increases, it’s variance increases and its bias decreases.
Imagine we are running simple linear regression over a data set to predict the height of a person based on his weight. No matter how well we fit our line (Ordinary Least Squares), the straight line is never going to have the flexibility to capture the true relation between the variables we are working with. This is also called bias.

On the other hand, we could use a super flexible line that passes over almost all the data points. This would for sure considerably decrease our bias. But… what about the variance?

If we calculate the Sum of Squares for both models, the flexible line clearly wins. But, hold on, we have to remember that this Sum of Squares is calculated on our training dataset. We actually want to calculate what is the error with a completely new dataset (Test dataset). Let’s see what happens if we input brand new data:

With the new data, we can see that the simple model does a much better job than the flexible one, that completely misses in the test data.
The difference in fits between datasets is called variance (of the model). The squiggly line fitted so closely the training data.
We call this overfitting, which will lead to a high variance in the model, which means that if we input new data and fit a new function, it would be quite different from the estimated function for the fist data set. With a more rigid model like simple linear regression, the estimated function for the new data set would remind almost the same.
An overfitted model will lead to high variance, as the prediction will be farther away (increasing the error) from the real dependent value when inputting new data (where the same feature values can perfectly show a different output) than a simpler model would do.
The ideal algorithm has low bias, so it can accurately approximate the true relationship between the dependent and independent variables. On the other hand, it has low variability, which means that it gives us consistent predictions across different datasets. For this, we will need to find the “sweet spot” between a simple model and more complex models.

“The most important aspect of statistical analysis is not what you do with the data, it’s what data you use”
