Common Mistakes in Data Analysis

Published in

DataDotScience

4 min readNov 21, 2017

“ To err is human” especially when you have to navigate the dark waters of data analysis. Here is a list of some of the most common mistakes along the way.

Having an ill-defined/vague hypothesis

Read up and build a clear picture of the result predictions corresponding to different theories. What are your assumptions? What is your hypothesis? How do they fit in existing theories? If you cannot define your hypothesis clearly, you will struggle with the analysis and interpretation of your results (Trust me, I’ve been there).

Allowing your bias to affect data analysis

Granted that you do have a clear hypothesis with well defined assumptions, you need to do your best to draw a line between having conceptual clarity and allowing your personal bias to affect your analysis. Issues like selective outcome reporting, confirmation and selection bias, outlier bias and Simpson’s Paradox are a handful of the many pitfalls you might encounter if you allow your intuitions lead the data, instead of the opposite way around.

Inferring Causation from Correlation

If you find correlation between variables, it is often tempting to assume that one of them causes the other. However, there are numerous potential pitfalls to take into account: both A and B could be consequences of a third spurious factor C but could have nothing to do with each other; A could be causing C which is the cause of B (or vice versa), there could be direct (A causes B), reverse (B causes A), cyclical causation (A causes B and B causes A in turn) or even a completely coincidental correlation, without any connection between A and B.

Tylor Vigen has famously exploited the latter fallacy in the blog Spurious Correlations, showing hilariously unrelated data sets some of which exhibit striking levels of correlation closely approximating 100%(r=1).

Overfitting Data

In an ideal situation, you can train a model with existing data and then apply it to another set of data with similar distribution in order to make predictions.

A common error would be to make an overly complex model which is trying to fit limited set of data points to such a close extent that instead of “weeding out” the noise, you actually include it in the model. In other words, instead of treating the isiosyncrasies of the data in your sample as noise, you erroneously add them as data points representing the complexity of the model.

When you hesitate how to explain it: just use cats…

As a result your model will exhibit excellent performance when you apply it to he initial data set, but would be extremely poorly suited when it comes to predictions about new data.

Underfitting Data

Underfitting is the reverse of the latter: a model which has poor predictive value because it misses parameters which it should have included. In other words, the hypothesis space explored by the learning algorithm is too small to represent the data. Thus, the model is highly biased and unable to infer valid knowlede from the initial training data.

Sidebar | Overfitting & Underfitting in Machine Learning
Jesus Rodriguez: ML models can be judged by their ability to accomplish 2 fundamental objectives: reducing the training error and reducing the gap between training and test errors.

Type 1 (False Positives) & Type 2 Errors (False Negatives)

In statistics and scientific research, you have a a general default statement called a null hypothesis H0 (e.g: “There is no relationship between regular exercise and improved cardiovrespiratory fitness (measured in VO2 max)”) which you seek to reject in order to confirm your alternative hypothesis H1 (E.g: “Regular exercise increases VO2 max and hence improves cardiorespiratory fitness”).

Type 1 errors occur when you incorrectly reject the null hypothesis: think of it as a “False alarm” (aka False positive).

Conversely, if you fail to reject a null hypothesis which is in fact wrong, you commit a Type 2 Error (False negative). E.g: a clinical trial failing to detect an existing condition.