Data Visualization with Python and Seaborn — Part 4: LM Plot & Reg Plot

Random Nerd
9 min readAug 17, 2018

Hope you have already gone through previous article on Data Visualization with Seaborn, and if not then please refer to the links at the bottom of this article! In this article, we shall be covering the concept of plotting Linear Regression data analysis, which is a very common method in Business Intelligence and Data Science domain in particular. To begin with, we shall at first try to gain statistical overview of the concept of Linear Regression.

As our intention isn’t to dive deeply into each statistical concept, I shall instead pick a curated dataset (you may also refer here for curated list of Resources for Data Science) and show different ways in which we can visualize whatever we deduce during our analysis. Using Seaborn, there are two important types of figure that we can plot to fulfill our project needs. One is known as ‘LM Plot’ and the other one is ‘Reg Plot’. Visually, they have pretty much similar appearance, but do have functional difference that I will highlight in detail for better understanding.

Linear Regression is a statistical concept for predictive analytics, where the core agenda is to majorly examine three aspects:

  • Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?
  • Which variables in particular are significant predictors for the outcome variable?
  • In what way do they (indicated by the magnitude and sign of the beta estimates) impact the outcome variable? These Beta Estimates are just the standardized coefficients resulting from a regression analysis, that have been standardized so that the variances of dependent and independent variables are 1.

Let us begin by importing the libraries that we might need in our journey and this is something we be doing at the start of every article in this series so that we don’t have to throughout bother about dependencies:

Let us now generate some data to play around with using Numpy for two imaginary classes of points. Please note that throughout the illustration, I wouldn’t be explaining data generation as that is a component of Data Analysis (not visualization). With that been said, let us try to plot something here:

Seaborn Lmplots:

Every plot in Seaborn has a set of fixed parameters. For sns.lmplot(), we have three mandatory parameters and the rest are optional that we may use as per our requirements. These 3 parameters are values for X-axis, values for Y-axis and reference to dataset. These 3 are pre-dominantly visible in almost all of Seaborn plots and in addition, there is an optional parameter which I would like you to memorize as it comes in very handy. This is hue parameter and it takes in categorical columns and kind of helps us to group our data plot as per hue parameter values. Let us see how it works:

Let us now understand what we see on the screen before we jump into adding parameters. This linear line across our plot is the best available fit for the trend of the tip usually customers give with respect to the total bill that gets generated. And the data points that we see at extreme top right which are far away from this line are known as outliers in the dataset. We may think of outliers as exceptions.

Goal of Data Science is to predict the best fit for understanding the trend in behavior of visiting customers and our algorithm shall be always designed accordingly. We may find this a common scenario while applying generalized linear models in Machine Learning. If we very closely notice, there is this shadow converging at the center where there is a chunk of our data. This convergent point is actually the statistical mean or in simpler words, the generalized prediction of tip value in this restaurant on a daily basis.

In this case, looking at this plot, we may say that if the total bill is around $20.00, then it shall get a tip of around $3.00. Let us refine this visualization even further by adding more features to the plot, and for this purpose let us try to understand if a Smoker in general tip more or less:

Reflects that Smokers that we see in blue color are little more generous but not so consistent with their tipping habit as the data points are quite vaguely spread out. So, addition of the 3rd parameter of hue helped us visualize this difference in separate color plotting, and has also added a legend with Yes, No to conveniently interpret. Let us look into other commonly used parameters to customize this plot further:

Here, we set data point marker style, altered the coloring and decided to remove the legend which by default is always there. Right now, be it for a smoker or for a non-smoker, representation is on the same plot so let us get it on separate facets:

There is a lot that we may experiment with by using different optional parameters, but in a shell, basic presentation with mandatory arguments remain the same. Let us visualize one more on built-in Tips dataset:

Above plot in 4 separate facets drill deeper into visualizing the data, where we still show the tip being given against total bill but is now also segmented into whether it was Lunch time or not along with dependency on Gender. There shall be multiple occasions where we would like to visualize such a broader segmentation. Currently we have a small dataset so we still have our hands tied but with real-world dataset exploration, this visualization gets limitless.

Now a generic usage of lmplot() where we shall generate random data points and then fit a regression line across it. I am showing this implementation just to give an overview of how it generally looks like in production environment and if you’re a beginner and not so proficient with Python programming, don’t really need to get stressed because with time, you shall gain command over it. Please note that our focus is just on visualization, so we won’t really get into NumPy module usage. Let’s get started:

Here we see jumbled up data points on the plot with a linearly fitted line passing through, thus reflecting best fit for existing trend as per dataset. In general, I would always recommend to keep the sequence of parameters intact as per sns.lmplot() official documentation which looks pretty much like:

seaborn.lmplot(x, y, data, hue=None, col=None, row=None, palette=None, col_wrap=None, size=5, aspect=1, markers='o', sharex=True, sharey=True, hue_order=None, col_order=None, row_order=None, legend=True, legend_out=True, x_estimator=None, x_bins=None, x_ci='ci', scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None, order=1, logistic=False, lowess=False, robust=False, logx=False, x_partial=None, y_partial=None, truncate=False, x_jitter=None, y_jitter=None, scatter_kws=None, line_kws=None)

Here the values that we see against few optional parameters are there by default, unless we specifically alter it in our code. Also, we need to always make sure that x and y feature values should always be strings to maintain tidy data format. If you feel curious to know in depth about tidy data, I would suggest reading a research paper by Hadley Wickham that titles Journal of Statistical Software.

Seaborn Regplots:

In terms of core functionality, reglot() is pretty similar to lmplot() and solves similar purpose of visualizing a linear relationship as determined through Regression. In the simplest invocation, both functions draw a Scatterplot of two variables, x and y, and then fit the regression model y ~ x; and plot the resulting regression line and a 95% confidence interval for that regression. In fact, regplot()possesses a subset of lmplot()'s features.

Important to note is the difference between these two functions in order to choose the correct plot for your usage.

  • Very evident difference is the shape of plot that we shall observe shortly.
  • Secondly, regplot() has mandatory input parameter flexibility. This means that x and y variables DO NOT necessarily require strings as input. Unlike lmplot(), these two parameters shall also accept other formats like simple NumPy arrays, Pandas Series objects, or as references to variables in a Pandas DataFrame object passed to input data.

The parameters for regplot() as per it's official documentation with all it's parameters look like this:

seaborn.regplot(x, y, data=None, x_estimator=None, x_bins=None, x_ci='ci', scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None, order=1, logistic=False, lowess=False, robust=False, logx=False, x_partial=None, y_partial=None, truncate=False, dropna=True, x_jitter=None, y_jitter=None, label=None, color=None, marker='o', scatter_kws=None, line_kws=None, ax=None)

There isn’t much of a visual difference so let’s quickly plot a regplot() to understand it better. But before I do that, I would like you to make a note of the fact that Seaborn regplot() or lmplot() do not support regression against date data so if we're dealing with Time-series algorithms, please make a careful choice. Also note that lmplot() that we just finished discussing is just a wrapper around regplot() and facetgrid()

For a change, let us also try to plot with NumPy arrays:

The datasets we have dealt with till now have data points pretty neatly arranged and hence presenting a logistic fit isn’t that cumbersome but let us now look at few complex scenarios. The very first one we are going to deal with is to fit a nonparametric regression using a lowess smoother.

This is a computationally intensive process as it is robust and hence in the backend it doesn’t take ci parameter, i.e. confidence interval into consideration. Here the line bends around to get more precise estimate as per the spread of data points, as visible. Let us get another built-in dataset available with Seaborn to have a better view of applied Logistic Regression scenarios:

This majorly helps to tackle Outliers in our dataset to fit a polynomial regression model to explore simple kinds of nonlinear trends because the linear relationship is the same but our simple plot wouldn’t have been able to trace it. Let me show how it would have looked:

With all that understanding, the only thing we shall lastly get acquainted with are the commonly used optional parameters:

  • Parameters like x_jitter and y_jitter are used to add noise to our dataset.
  • color parameter helps you get Matplotlib style color.
  • dropna helps to drop NaN (NULL) values.
  • x_estimator param is useful with discrete variables.
  • ci represents the size of Confidence interval used when plotting a central tendency for discrete values of x.
  • label is used to assign a suitable name to either our Scatterplot (in legends) or Regression line.

Thank You for your patience throughout this visually exhaustive article and hope to see you in the next one where we shall delve in Scatterplots and Jointplots. we have in length gone through two important aspects of Regression plotting. Third one, i.e. facetgrid is parked for now and we shall take that as well very soon. For previous articles of this series, please use the links below. Happy Visualizing!

EDIT: Here is Resource Content (List of all the Parts of this Series).

Data Visualization with Python and Seaborn — Part 3

Data Visualization with Python and Seaborn — Part 5

--

--