Data Storytelling to help in your analysis
In this article, I want to show how the book Storytelling with Data helped me build charts that translate complex datasets with text, figures, and graphics.
You have two graphics below to show off how this book made me think out of the box. The first one is a graphic that I made before the read and the second one after the read.
The book’s concepts brought about how to pass the information in a dataset more straightforwardly and effortlessly to the audience, with clean and clear charts.
The book has a section that explains the Gestalt Principles of Visual Perception, which are:
- Proximity
- Approximation
- Similarity
- Continuity
- Closure
- Connection
Applying these concepts to your charts will help the audience to understand the information more efficiently and focus on what you are showing.
Storytelling with Data also brings ideas about how we can express the data into different charts, as like the line graph below, in which I compared the number of new cases of COVID-19 for Brazil and the World:
Another concept that helps a lot in expressing the data is the use of numbers and text to pass the information instead of using a chart, as shown in the following figure:
So as we could see, the book pays off the investment done, and one of the ideas the book brings to me was to mix numbers and charts.
As is shown in figure 5, the idea was to express the number of total people vaccinated until December 8th of 2021 and next to show how was the development of the whole people fully vaccinated per month.
Now it is time to explain the dataset and make some of these charts. The dataset that I used came from the GitHub page of Our World in Data, and there you will find the explanation about all the variables (columns) that exist in the dataset. With that clear now, let’s go to the project I made about the COVID-19 situation in the World until December 8th, 2021.
Before showing the script to build the graphics in figure 5, we first have to call the libraries to help extract, load, and transform the data, and after that, explore what the dataset was to us.
# Calling the libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# configuring the graphics
%matplotlib inline
sns.set_style()# DataFrame until 08/12/2021
PATH = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv"
df_w = pd.read_csv(PATH)
df_w = df_w[df_w.date <= '2021-12-08']
# The first five data
df_w.head()
After this, I saw the data frame and the types of data I had and made the needed changes. Then, moving forward in the project, I first build an initial analysis, like building the chart about the moving average smoothed and bar charts to the countries with the highest values for the number of new cases and total deaths COVID-19, which the dataset provided.
Now to build the chart like figure 5, I needed first to understand the variable continent, like what was the continent that I had in that column, and to do that, I started the script in this way:
# What are the entries in the continent column
df_w['continent'].unique()
The out was:
array(['Asia', nan, 'Europe', 'Africa', 'North America', 'South America',
'Oceania'], dtype=object)
After that, I had to import the library DateTime to build a column month which in that column, I only had the months of the column date.
# import the library datetime
import datetime as dt
#get a month from the column date and put this into the dataframe
df_w['month'] = df_w['date'].dt.month
With those two steps done, I only needed to get the data for the number of people fully vaccinated for each continent. And to do that, I used the function groupby().
# Total number of people who received full vaccinated for each continent
df_w.groupby(['continent']).people_fully_vaccinated.sum().sort_values(ascending = False)
The out was:
continent
Asia 1.043767e+11
Europe 6.021483e+10
North America 5.600873e+10
South America 2.705435e+10
Africa 4.168491e+09
Oceania 2.053342e+09
Name: people_fully_vaccinated, dtype: float64
And with the data extract, transform and load, now I just needed to apply all the concepts that I learned through the book Storytelling with Data with the Matplotlibrary:
# Dashboard for the number of people fully vaccinated for each continent
sns.set_style('ticks')
fig, (ax, ax1) = plt.subplots(nrows = 1, ncols =2, figsize=(18,6))
# Remove the horizontal gap between the subplots
plt.subplots_adjust(wspace= -.50)
# Text figure
ax.text(0.0,0.900, s = 'South America', size = 30,color = 'DarkSlateGray',weight ='bold')
ax.text(0.0, 0.40, s ='26 billions',size = 65, color = 'DarkOrange', fontweight = 'bold')
ax.text(0.0,0.250, s = ' of people fully vaccinated until December \n 8th of 2021', size = 18,color = 'DarkSlateGray', weight = 'bold')
sns.despine(left = True, bottom = True)
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
# The variable pf_v will be the values for the data at the bar plot
pf_v = (df_w.loc[(df_w.continent == 'South America') & df_w.continent.notnull()].sort_values(by ='people_fully_vaccinated', ascending = False)).copy()
# the barplot
pf_plot = sns.barplot(x = 'month', y = 'people_fully_vaccinated', data = pf_v, ax = ax1, color = 'DarkOrange', ci = None)
#ax1.set_title("Total people vaccinated in South America in 2021 \n", loc = 'left', size = 20, color = 'DarkSlateGray')
ax1.set_xlabel([], color='white')
ax1.set_ylabel([], color='white')
ax1.set_xticklabels(["Jan","Feb","Mar","Apr","May","Jun","Jul", "Aug","Sep",'Oct',"Nov",'Dec'],{'fontweight':'bold','fontsize': 12})
ax1.set_yticklabels([],color = 'white')
sns.despine(left=True, bottom=True)
plt.tight_layout()
So there were the steps and codes that I used to transform the data that I had into a CSV file in some graphics that straightforwardly pass the information.
You have to click on this link to see the whole project and all the analyses that I made about this subject with this data set. And with you want to see other projects that I have made so far, you can access my GitHub PROFILE page, and for any feedback, you can contact me on my LinkedIn page.