Q-Q Plots: Guiding you to better results

Saiyam Bhatnagar
Analytics Vidhya
Published in
3 min readJul 23, 2020

Hello readers. in one of my previous blogs I endeavored to demonstrate how to effectively model a Regression Problem. That post was large in large, concerned about how many features should actually be a part of you model. However, this post is concerned with improving a feature using Feature Engineering to enable it fit the Regression Assumptions well.

Among several assumptions on which a Linear or Logistic Regression is based — No Multicollinearity, No Heteroscedasticity etc , one is the Normality of Independent variables. The more Normal the distribution is, better are the results of the Regression model. The readers who aren’t well aware of the reason behind these assumptions are required to go through it once. :-) To practically illustrate my concept of visualizing through Q-Q plots, I downloaded the Kaggle- DataAnalyst dataset and applied few transformations to it to make as normal as possible. However, before diving into this lets see what a Q-Q plot actually demonstrates.

What is a Q-Q plot?

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. For example, if we run a statistical analysis that assumes our dependent variable is Normally distributed, we can use a Normal Q-Q plot to check that assumption. It’s just a visual check, not an air-tight proof, so it is somewhat subjective. But it allows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation. A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

This largely means- greater the number of points on a line, greater is the extent to which the data sample follow the mentioned distribution — example Gaussian, Normal etc. Now, I’ll apply few transformations and plot the adjoining Histogram and a Q-Q plot for the readers to decide which transformation would best suit for the Feature Engineering. Please keep in mind this is not a conceptual analysis. This is just a methodology demonstration. The feature chosen is the rating of Data Analysts on a scale of 1–5.

  1. No Feature Engineering
Histogram and the corresponding Q-Q plot respective to Normal Distribution.

2. The feature is transformed using Box-Cox

Histogram and the corresponding Q-Q plot respective to Normal Distribution.

3. The feature is transformed using Logarithmic transformation.

Histogram and the corresponding Q-Q plot respective to Normal Distribution.

There is not much difference in the plots as I was unable to choose the appropriate feature which demonstrates this decently. Guessing the deviation of a Real World data-set from a distribution is too hard to arrive at. However, I do recommend the readers to try this illustration before using a Regression Model. Have a good day. Thank you. :-)

--

--