Basic Data Viz with Matplotlib

Joel Sherman
5 min readJul 9, 2020

--

So I’ve learned some of the basics of Pandas. I can explore my data, examine and treat missing values, pivot and summarize the frame, do some indexing, subsetting, etc. I practiced and wrote about that experience here.

But I’m a visual learner (and I bet you are too) and I now want to use Python’s matplotlib package to produce some basic plots. The purpose of this article is to review, at a very basic and high level for new learners, the utility and implementation of four (4) common plot types:

  1. Histograms
  2. Bar Graphs
  3. Line Plots
  4. Scatter Plots

Once again, I’ll work with data that I track and consolidate on my cycling training, which is available on my public github account. After some subsetting and tidying, I’ll call my data frame ‘df’. Shocking, I know.

Histograms

Histograms are one of the most effective plot types for showing the distribution, or spread, of continuous, numerical data, and are useful when initially exploring a new dataset. In my dataset is a numerical feature called ‘TrainTSS’, which measures how stressful a particular training day was for me in terms of intensity and duration. Because it’s usually unproductive (and impossible, given that I’m a father of young children, work at home, and am learning data science!) to train every day, I’m going to look at a histogram of my ‘TrainTSS’ for only the days that I’ve trained (i.e TSS is not zero). Using matplotlib, we call the .hist() function:

# create the histogram with 5 bins
df[df['TrainTSS']!=0]['TrainTSS'].hist(bins = 5)
# add x and y axis labels, and a title
plt.xlabel('TSS')
plt.ylabel('# of Rides')
plt.title('Histogram of Ride Stress')
# show it
plt.show()

Pretty easy. And I’m referring to both plotting a histogram and looking at my ride stress since the beginning of the year. No wonder I’m slowly losing fitness! My typical ride is only between 40 and 50 TSS, which is low for me and not enough to cause the adaptations that I need to get faster. But I digress. Let’s look at bar graphs now.

Bar Graphs

Bar graphs are typically used to compare numerical measures (typically of central tendency, like average or median) across values of a dimension column. For example, we could compare the average height (measure) between men and women (gender dimension) using a bar graph. In my dataset, I have a feature called ‘TrainType’, which is an object that describes how intense my training day was, as well as a feature called ‘SleepHrs’, which reports the hours in which i slept the night before a training. Let’s see how my average ‘SleepHrs” varied by ‘TrainType’ by using .plot() function:

# calculate avg sleep hours for each train type day
df_grouped = df.groupby('TrainType')['SleepHrs'].mean()
# create the bar graph
df_grouped.plot(kind = 'bar')
# add x and y axis labels, and a title
plt.xlabel('Type')
plt.ylabel('Avg Sleep Hours')
plt.title('Avg Sleep in Hours by Training Day Type')
# show it
plt.show()

Super boring stuff (I’m referring to my sleep patterns, not creating the bar graph). I maintain a pretty steady rhythm of about 8 hours per night of sleep, whether or not I have a rest day or intense training day the next day. Sleep is vital and not to be messed with. Ok, onward. Next up are line plots.

Line Plots

Line plots are typically used to plot a numerical measure over time. As an econometrician, they are indispensable to me because I typically (almost exclusively) work with time series data. In this dataset however, I’ll plot a time series of the ‘RecScore’ feature, by day. ‘RecScore’ is an endogenously calculated score of my recovery for a particular day, on a scale between 1 and 10, with higher values representing higher parasympathetic activity, lower stress, and better recovery (in general). It’s calculated by a tool that I use called HRV4Training, that takes in heart rate variability data from a ring that I wear while sleeping. More information on those can be found here and here. Here again, we use .plot() function:

# create the line plot
df.plot(x='Date', y='RecScore', kind='line')
# add x and y axis labels, and a title
plt.xlabel('Date')
plt.ylabel('Recovery Score')
plt.title('Trend in Recovery Score')
# show it
plt.show()

While it looks like January 2020 had measurements that appear to be anomalies, for the most part, my recovery score has been holding pretty steady at about 7 over time, without too much variability, since February. And now, let’s look at scatter plots.

Scatter Plots

Just as a line plot displays data over time, scatter plots display data points at their respective intersections of two measures, and are a very useful plot for examining relationships between two numerical features. Here, I’ll make a scatter plot of my training stress (‘TrainTSS’) against my recovery score (‘RecScore’) to see if there are any relationships. Once again, we’ll use .plot() function:

# create the scatter plot
df.plot(x='TrainTSS', y='RecScore', kind='scatter')
# add x and y axis labels, and a title
plt.xlabel('TSS')
plt.ylabel('Recovery Score')
plt.title('Has Ride Stress Been Impacting my Recovery Score the Next Day?')
# show it
plt.show()

Its pretty clear that the TSS of my training rides doesn’t show much relationship to my recovery score the next day. On days that I have no ride (TSS = 0), my recovery score the next day fluctuates just as much as it does on days that I post mid to even high TSS rides, on average.

Despite the lack of evidence in my case here, there is a well-established theoretical relationship between these features, and as my histogram above suggests, the relationship may not be present in my data because of the relatively low-stress rides I’ve been doing this year.

I hope this article has been as useful to you as it’s been to me. Not only have I gotten to practice some skills in visualizing data in matplotlib, but I’ve also learned that I need to up my training game if I want to get faster!

--

--

Joel Sherman

I’m an experienced data professional at the intersection of public policy and economics, trying to make sense of the world, one dataset at a time.