Telling a Story in Seaborn: Tips and Tricks

Berke Tezcan
Analytics Vidhya
Published in
7 min readApr 2, 2021

--

Why visualizations matter and a few insights from a new Data Scientist

It has been about a month since I started my data science journey with the Flatiron School and I have realized, over the past several weeks, how important it is for a data scientist to make compelling visualizations. You may have the best analysis in the world but if you can’t tell the story behind it with clear and easy-to-understand visualizations, you will lose your audience very quickly. That’s why when we were tasked with our first project for the data science program, I wanted to make sure that I had clear visualizations that were specifically designed with my audience in mind.

The project posed a hypothetical business problem where Microsoft wanted to get involved in the movie industry by starting their own movie studio. To be able to compete and be successful in this new territory, they were hiring us to come up with actionable insights for them to implement in this new endeavor by analyzing movie data from online databases such as IMDb and The Numbers. I was going to be presenting my information to the Microsoft board and not a room filled with data scientists or statisticians, so I had to make everything as non-technical as possible.

After many hours of cleaning the data and getting everything ready to plot, I had to decide which visualization library to use. Do I keep everything simple and use matplotlib’s pyplot, or use seaborn (which is built on top of matplotlib), or something entirely different like plotly? I ended up deciding on seaborn because of how customizable everything is and the visually appealing options it offers.

The first relationship that I wanted to look at was between the release month and the gross revenue of movies. I wanted to convey the message that movies released in certain months had higher median revenues. I picked a boxplot because I wanted to show the spread of the data as well as the mean of the revenue for the specific month so I just plugged them into seaborn’s boxplot and voila, a perfect visualization! Well… not really. It was all colorful but the graph was kind of a mess. Just to name a few problems: the axis labels weren’t clear, the months were all out of order, outliers were showing up and the list goes on…In order for this visualization to tell a story, I needed to address these issues. Here’s the thought process I went through when making these changes, as well as my tips and tricks for seaborn. Let’s dive in!

fig, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x=imdb_tn_filtered[‘release_month’],
y=imdb_tn_filtered[‘worldwide_gross’], ax=ax)

To start off, we can get rid of the outliers since they aren’t adding anything to the relationship we are trying to show but rather making it difficult to see. Seaborn has a parameter called showfliers in their boxplot method that allows you to exclude the outliers from the graph.

sns.boxplot(x=imdb_tn_filtered[‘release_month’], y=imdb_tn_filtered[‘worldwide_gross’], ax=ax, showfliers=False)

Much better, but it is still very difficult to see the relationship between revenue and the release months since all the months are out of order. To fix this issue we need to create a list telling seaborn the exact order we would like to have on the x-axis and pass it into the argument called… you guessed it: “order.”

order = [‘Jan’, ‘Feb’, ‘Mar’, ‘Apr’, ‘May’, ‘Jun’, ‘Jul’, ‘Aug’, ‘Sep’, ‘Oct’, ‘Nov’, ‘Dec’]sns.boxplot(x=imdb_tn_filtered[‘release_month’], y=imdb_tn_filtered[‘worldwide_gross’], ax=ax, showfliers=False, order=order)

Now we’re getting somewhere! Take a look at the summer months. Their medians are noticeably higher than the rest of the months. This is definitely an important takeaway for Microsoft and their future studio’s success. The movies that they’ll be making should be released in the summer months — especially May followed by June — since historically these are the months successful movies have the highest median revenue in.

Since we will be presenting this to the Microsoft board, the rainbow colors are, in my opinion, a bit overwhelming and confusing. When I first look at this graph I may think that the different colors represent different variables but they don’t. So there is no reason to make each boxplot a different color. That being said, we still want the graph to have visual appeal to it, so we can use a gradient. Here, I picked a color that I thought looked professional and did a gradient from “light” to that color (#5A9) and then reversed it with “_r” to have January be the darkest and December be the lightest.

sns.boxplot(x=imdb_tn_filtered[‘release_month’], y=imdb_tn_filtered[‘worldwide_gross’], ax=ax, showfliers=False, order=order, palette=”light:#5A9_r”)

We are definitely close but still have an issue: it’s still difficult for a non-technical person to be able to tell that the line in the middle of a boxplot is the median value. We need to highlight those lines somehow. We can easily achieve this by slicing out the median values for each month and plotting a pointplot on the same figure and axes that the graph is on.

medians = imdb_tn_filtered.groupby('release_month')['worldwide_gross'].median().reset_index()medians.sort_values(by='worldwide_gross', ascending=False)sns.boxplot(x=imdb_tn_filtered[‘release_month’], y=imdb_tn_filtered[‘worldwide_gross’], ax=ax, showfliers=False, order=order, palette=”light:#5A9_r”)sns.pointplot(data=medians, x=’release_month’, y=’worldwide_gross’, order=order, ax=ax, color=’black’)

Nice! It is now much easier to see the actual relationship of median revenue between different months with the spread in the background for reference. But we are not done yet. We still have to fix our axis labels and give a title for the figure so the audience understands what they are looking at. We can do this by using the axes (ax) that the graph is on as follows:

ax.set_xlabel(‘Release Month’)ax.set_ylabel(‘Total Revenue ($)’)ax.set_title(‘Release Month vs. Total Revenue for Movies Released Between 2009–2019’)

With this adjustment our labels make sense and anyone can easily understand what the graph is supposed to be displaying. “But wait! What about our y-axis numbers??” you may be asking. You are absolutely right. Currently it seems like our movies’ median revenues are ranging between 10 cents and 40 cents. Of course, this is not the case: a 0.1 actually denotes $100 million. We should adjust the formatting of our y-axis numbers by using the FuncFormatter. Here’s how (to read more on it check out this link):

from matplotlib.ticker import FuncFormatterdef millions(x, pos):    return ‘%1.0fM’ % (x * 1e-6)formatter = FuncFormatter(millions)

Finally, we can pick an appealing style for our graph. If you already know what kind of style you would like to use, you can simply just call the with plt.style.context line and pass in the name of your chosen style. However, if you don’t know which style you want to use, here’s a nifty trick. By importing interact from ipywidgets and defining a function called plot_style, you can pick from a dropdown menu of the available styles and change your figure in real-time. I think we can go with a ggplot in this case since the horizontal gridlines help with seeing the corresponding y-axis values and the darker background makes our plots pop. So here’s our final product:

#setting up an order list for x-axis ticks.
order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
#setting up a function to format y-axis values.
from matplotlib.ticker import FuncFormatter
def millions(x, pos):
return '%1.0fM' % (x * 1e-6)
formatter = FuncFormatter(millions)#slicing out the median values for the point plot
medians = imdb_tn_filtered.groupby('release_month')['worldwide_gross'].median().reset_index()
medians.sort_values(by='worldwide_gross', ascending=False)#plots
with plt.style.context('ggplot'):
fig, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x=imdb_tn_filtered['release_month'],
y=imdb_tn_filtered['worldwide_gross'], ax=ax,
showfliers=False, order=order,
palette="light:#5A9_r")

sns.pointplot(data=medians, x='release_month',
y='worldwide_gross', order=order, ax=ax,
color='black')
ax.set_xlabel('Release Month')
ax.set_ylabel('Total Revenue ($)')
ax.set_title('Release Month vs. Total Revenue for Movies
Released Between 2009-2019')
ax.yaxis.set_major_formatter(formatter)

To conclude, it is vital for data visualizations to tell a story. When the figures are tailored to their target audience, they can reflect the findings more easily and have more of an impact. Although this is not an exhaustive guide, making sure to pick the best type of figure for the relationship being explored as well as having relevant axis labels, a clear title, a legend (where they apply) and a visually appealing style will help in creating striking visuals. Oh, and don’t forget, if you happen to start a movie studio, make sure your movies are released in May or June!

For more information:

Check out my full notebook and analysis on GitHub by clicking here.

--

--

Berke Tezcan
Analytics Vidhya

Aspiring data scientist, video game enthusiast, bookworm, engineer. linkedin.com/in/ebtezcan/