How To: Visualize Your Data in Python

Published in

Analytics Vidhya

8 min readJan 3, 2020

This is part 3 of 3 in the series working with the 2017 running backs. Click here to view part 2.

Any data related problem can be split into three steps: getting and cleaning the data, visualizing and analysis, and then interpreting the analysis to verify (or refute) a claim.

So far, we’ve learned how to get data from the internet and use Pandas to clean our data and make it ready for analysis. Now, we’ll be learning how to do the next two steps! Let’s learn how to visualize the data so we can draw some meaning from it.

In this How To, we’re going to be learning how to take our already clean DataFrame and create different types of plots, using matplotlib and seaborn, that will give us insight into the 2017 NFL Draft and the running backs that came out of the draft.

Visualizing Our Data

We have our DataFrames, but we can’t just look at the numbers in them to learn about our running backs. This is where visualizing our data comes into play.

Plotting in Python

DataFrames are built to be usable by many of the graphing libraries in Python. These libraries help us take vast amounts of data and plot them in nice looking charts that can show us trends and insights. (And you can make pretty graphs!)

The graphing libraries we are going to use are Matplotlib and seaborn (which is actually based on Matplotlib). There are so many cool libraries out there, and I encourage you to explore yourself.

We will only be going over some of the many features that Matplotlib and seaborn have. There is so much more you can do!

Start by importing the necessary libraries:

Let’s look into how plotting with Matplotlib works. The first thing we’re going to do is set the figure and axes by unpacking the .subplots function.

fig, ax = plt.subplots(figsize=(20,10)) # change the figure size

This will give us separate figure and axes objects, that we can then add plots and text (title, labels, etc.) to.

Count Plots

We’ll start by looking at our 2017 running backs. Let’s say we are interested in which conferences were the best at getting running backs drafted. We can use seaborn’s count plot to count the unique values of a column and display a plot of those values.

To do so, we set our axes ax to sns.countplot with the necessary arguments: which column we want to use, which dataframe we are using, and optionally if we want there to be an order to the values being plotted. You can use the axes object to set a title if you want, and then finally fig.show() .

The ACC, Big 12, and MW conferences are sending talent into the NFL.

You can go crazy with these plots, and show whatever it is you want to show. As one more example, before you try it on your own, let’s look at the share of rounds with a similar count plot.

The 4th round had nearly 1 in every 4 picks be a running back!

LM Plots and Violin Plots

We’ve looked at the draft data, but this doesn’t tell us how these running backs are doing now. Let’s jump back to the new DataFrames we have that deal with 2018 and 2019 data. Remember, these running backs are just in their second and third year, respectively, of their career, so if a majority of them are performing at a high level, we can conclude that the 2017 running backs are indeed highly skilled.

From the Football Outsiders website:

The simple version: DYAR means a running back with more total value. DVOA means a running back with more value per play.

This time, we’ll start by using lmplot, which plots the data points as well as a line of best fit. With lmplot, we can pass in a column as the hue , which sets the color based on the column values. We want to set the hue to be the drafted_2017 column so we can see the difference between players drafted in 2017 and those who were not. In this way, we can show many different things.

To graph 2019 data, we can use the same format and change the column names if necessary (Success Rate has a slightly different name), and which DataFrame the data is coming from.

Left: On average, 2017 RBs tend to gets more yards per run (for running backs that ran more than ~120 times, which is reasonable).

When looking at total value, 2017 RBs seem to be more efficient than non-2017 RBs. Their Yards:DYAR ratio has a higher slope, as does their DYAR:TD ratio, which begins to be better than non-2017 RBs after a running back’s 4th touchdown.

From these graphs, we can see that for the most part, 2017 backs are performing at a higher overall level than other running backs in the league in 2018! The most interesting graph is the DVOA vs. Success Rate graph. The Football Outsiders website tells us that “A player with higher DVOA and a low success rate mixes long runs with downs getting stuffed at the line of scrimmage. A player with lower DVOA and a high success rate generally gets the yards needed, but doesn’t often get more.” This means having both be high is the best case scenario: your plays are successful, and each play is valuable. In our graph, we see that the 2017 RBs slope is higher, indicating a higher Success Rate associated to DVOA than non-2017 RBs.

How about in 2019? We could assume that these running backs will get better because they’ll have more experience, but we could also assume that other running backs will get better and catch up to the level of ours.

It seems that other RBs are now performing at a very similar level to our 2017 RBs.

But interestingly enough, their value efficiency still is higher than other RBs. In fact, both Yards:DYAR and DYAR:TD ratios are significantly higher than in 2018.

Well this is interesting. At first glance, it looks like all of the running backs are performing at a very similar level. The estimated yards per run are virtually the same for both groups, and the DVOA vs Success Rate graph seems too clustered around the center. When we look at the value efficiency, however, we can see that 2017 running backs are still doing better. So maybe in 2019, the overall skill gap has decreased.

So sns.lmplot can tell us some cool things. Let’s explore one last type of graph: the violin plot.

A violin plot is a box plot that also shows density within the plot. What this means is not only will it tell us the median and quartiles of the data, but it will also tell us the distribution of the data; whether it is centered around the median, left-skewed vs. right-skewed, etc.

Note: Remember, in our case, the median is a good statistic to look at because it is robust. This means that it is not effected by outliers or other imperfections in the distribution. (The mean on the other hand is not robust, as a really high outlier could bring up the mean, and cause us to think highly of an entire group, when that is not the case).

You might think that implementing a violin plot is just as easy as the other plots we’ve looked at, and you’d be right! sns.violinplot has some optional arguments that we can pass in, but in our case, we’ll just tell it the data source, and the two axes.

We can simply change the data source and we’ll also have our 2019 violin plots!

Here we see the distribution and statistical values of Effective Yards Per Attempt and Yards Per Attempt across our two groups (in 2018). Just a year after coming into the league, running backs drafted in 2017 have a higher median in EYPA as well as YPA. EYPA is a full yard more! This in itself wouldn’t be as significant, but the violin plot also tells us that the distribution of these running backs is normal (except for non-2017 running backs for YPA), meaning a majority of them are near the median.

With that, we can say that the majority of running backs drafted in 2017 had a higher YPA, and more importantly EPYA, than running backs not drafted in 2017. Now let’s see if this holds going into 2019.

Again, we see that the plots are normally distributed, so running backs are performing close to the median. (However, we still see that for YPA, running backs not drafted in 2017 have a weird distribution.) The median values have gotten very close: for EYPA the difference of the median seems to only be ~0.2 yards, while for YPA they look almost the same. This proves what we saw in our lmplots, where we saw the gap between the two groups decrease.

Conclusion

Wow! We were able to make visuals that represent our data, and learned how to draw conclusions for the data. Based on what we gathered, running backs drafted in 2017 took the league by storm in their second year and perform at a much higher level than other running backs. In their third year (this past NFL season), the two groups have evened out in performance, but in a lot of the major categories, our 2017 running backs are still performing a bit higher than others.

I hope this set of tutorials taught you the following things:

Data is all around us, you just need to go out and find what you are looking for.
Using requests and BeautifulSoup, you can query the web and get back HTML (then Pandas allows you to store it as a DataFrame).
Pandas is a great library that allows manipulation of data with its built in functions (iterrows, loc, etc.)
We can use these DataFrames to visualize our data with matplotlib and seaborn, to draw conclusions from the data.

That’s the biggest thing to remember: by looking at data, we can figure out things that normally aren’t visible to our naked eye.

The complete code from this tutorial is below:

As always, I’d love to hear what you thought about this post! Send any questions or comments to amanjaiman@outlook.com.