Econometrics with Python pt. 3.3
ScatterPlots
For this part of Econometrics with Python, we will focus on scatterplots. Scatterplots are used to view the relationship between two variables. Needless to say, scatterplots are important tools for business analysts, data analysts, and economists. With a little help from certain libraries, python can produce beautiful and informative scatterplots with ease. Let’s take a look at how we can do that.
Don’t forget that you can download the data and code to follow along with this article here.
We should start with the simplest example. But as always, we are going to read in the data and import the necessary libraries first:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltschools = pd.read_csv('~/Desktop/econometrics_w_python/caschool.csv')
A basic scatterplot with no extra flair can go a long way if you are trying to get a quick look at the relationship between two variables:
plt.plot(schools.str,schools.testscr,'r*')
plt.show()
If you read my previous article on boxplots, you may recognize that ‘r*’ syntax. This allows you to control the color and shape of the plotted coordinates with plt.plot(). Matplotlib also has a plt.scatter() method, which works similarly to plt.plot():
plt.scatter(schools.str,schools.testscr)
plt.show()
The big difference between plt.plot() and plt.scatter() is that plt.plot() can plot a line graph as well as a scatterplot. If we had passed ‘r-’ instead of ‘r*’ matplotlib would show a line chart rather than a scatterplot. Since line charts only really make sense with some measure of time on the x-axis, we won’t do that. You can change the color of the coordinates by passing a string in the ‘color’ parameter:
plt.scatter(schools.str,schools.testscr, color='lightsalmon')plt.title('Test Scores by Student Teacher Ratio')
plt.xlabel('Student Teacher Ratio')
plt.ylabel('Average Test Scores')plt.show()
I encourage you all to check out matplotlib’s named color options here. We also added xy labels and a title to our scatterplot. Since the goal is to get a sense of the relationship between Average Test Scores and Student-Teacher Ratio, a regression line might be very helpful for this scatterplot. Matplotlib has no built-in function for this, but we can easily use numpy and plt.plot() to write our own custom function:
And now we have the best fit line! This is especially helpful when you are plotting hundreds of observations and it is hard to spot any specific trends. However, we cannot come to any serious conclusions from this graph.
The scatterplots above are great for our own use, but they are lacking flair. This is where seaborn can be very helpful. Let's make our scatterplot prettier with seaborn:
If you read my last article (pt. 3.2) you know that I like to set the background of my plots to white and ‘despine’ my plots. I think it looks much cleaner and is easier on the eyes. Seaborn’s scatterplot also has a hue parameter, in which you can pass a categorical variable:
Now we’ve added a little more complexity to our scatterplot. We can see a legend was generated for us. In this case, the default location of the legend is actually not where we want it to be. It looks like there is a perfect spot for the legend in the lower right-hand corner, so let’s move it there:
Seaborn’s scatter plot has another parameter, ‘style’, that works in the same way as the hue parameter, only it sets the shape of the marker based on a categorical variable. You can use ‘hue’ and ‘ style’ together and create a very informative scatterplot:
We also make the size of the graph larger so that the different shapes are more easily seen. Our scatterplot now gives us a lot of information, maybe too much information for one scatterplot. One way to do a similar analysis without having too much visual complexity is to build scatterplots based on a subset of the dataset:
Now we are only looking at schools that are in Santa Clara or Los Angeles County. We do this by using Pandas’ query method. This is another way to compare how the relationship between two variables may change across different categories.
Regplots
In addition to scatterplots, seaborn also has a method called ‘regplot’. Regplots are essentially scatterplots with a regression line fitted in the plot as well. There are many cool features in regplots, so let's go over a few.
Lets start with a basic example:
sns.regplot(x='expn_stu', y='testscr', data = schools)plt.xlabel('Expenditure Per Student')
plt.ylabel('Average Test Score')
sns.despine()
plt.show()
As a default, regplot provides the 95% confidence interval of the regression estimate. We can easily remove the confidence intervals by setting the ‘ci’ parameter to None or False:
sns.regplot(x='expn_stu', y='testscr', data = schools,
ci=False)plt.xlabel('Expenditure Per Student')
plt.ylabel('Average Test Score')
sns.despine()
plt.show()
The ci parameter takes in any number between 0 and 100, so you could pass 99 for a 99% confidence interval of the regression estimate as well. I do not recommend showing the confidence interval if the dataset is very large; the calculation can be time-consuming with lots of data.
We can use matplotlib keywords to change the color of the regression line:
sns.regplot(x='expn_stu', y='testscr', data = schools,
ci=False, line_kws = {'color':'plum'})plt.xlabel('Expenditure Per Student')
plt.ylabel('Average Test Score')
sns.despine()
plt.show()
Another relevant feature of regplot is the ‘robust’ parameter. If set to True, this calculates the regression line with robust standard errors (i.e. adjusted for heteroskedasticity). So if you want to be extra robust (pardon the pun) you can use this parameter like so:
sns.regplot(x='expn_stu', y='testscr', data = schools,
ci=False, line_kws = {'color':'plum'},
robust=True)plt.xlabel('Expenditure Per Student')
plt.ylabel('Average Test Score')
sns.despine()
plt.show()
You can see that the line doesn’t look much different, and in most situations that should be the case. However, I think it’s really cool that this feature is there.
Usually when I present a scatterplot with a regression line, I like to show the r-squared coefficient somewhere on the graph. This can be done easily with plt.annotate():
The tuple containing (4000, 700) represents the x and y coordinates of the text to annotate. I just eyeballed this in case anyone is wondering where those come from. I normally just look for some empty space on the graph and pass the xy coordinates of where that space might be.
Also, you may be wondering about his code here:
from statsmodels.formula.api import olsmodel1 = ols('testscr ~ expn_stu', data=schools).fit(cov_type = 'HC3')
This is one way to create a regression model in python. I will go over this syntax in much more detail in the next article of this series! For now just know that we access the r-squared (rather, the adjusted r-squared) after building the model like this:
model1.rsquared_adj
seaborn.regplot() does not have some of the features that I like from seaborn.scatterplot(). Sometimes I like to use seaborn.scatterplot() in conjunction with the ‘lines’ function we wrote earlier rather than using seaborn.regplot():
Plotting a regression line on a scatterplot that colors the coordinate points based on a category helps me see trends that might have otherwise gone unnoticed.
Bonus: LM plots
Seaborn.lmplot() provides a way to fit different regression lines across subsets of data. I just wanted to breeze over lmplots, because I know they can be useful in special situations. Let’s look at a basic example:
sns.lmplot(x='expn_stu', y='testscr',
hue = 'smallclass',
data = schools,
ci=False)plt.xlabel('Expenditure Per Student')
plt.ylabel('Average Test Score')
sns.despine()
plt.show()
This plot produces some interesting (and frankly unexpected) results. I would caution you not to make any strong inferences from this graph, however. If you really want to see the difference in slopes between two categories, I would recommend using an interaction term in a regression model and looking at the results.
We can also control the color of the lmplot like so:
Now we have an lmplot with some nicer looking colors! You may have noticed that in some of the code, there is a line like ‘plt.savefig(‘scat17.png’)’. This is for me to save the graphs I made for this article as a png file. If your only goal is to display the graph inline, you can remove that line of code.
That concludes this article on scatterplots! In fact, that concludes all the data visualization for this series. I may include an article that focuses specifically on visualizing regression residuals (for diagnostic purposes) but we will be moving on from general data viz now. I will surely write more about seaborn, matplotlib, and other graphing libraries in the future, just not in this tutorial series.
Thank you to everyone who has been following this series and to everyone who has read this article. I sincerely appreciate your support, and I really hoped this article helped you in some way. If there are any questions, please leave them in the comments below. I will be sure to answer them. In the next article, we will finally discuss simple and multiple regression in python. Stay tuned for that!