Working with USDA’s coffee database: Part 2
In Part 1, we built a simple script to automate data extraction from the USDA’s website and put all of their coffee statistics into a Pandas DataFrame. We also wrote two modules that made it easier for us to manipulate and query the dataset.
In this post, we’ll do some useful analysis on our usda
table and visualize the results with Matplotlib. The goal of our visualization will be to show which coffee producing countries are most dominant and how their market share has grown in recent years.
First, let’s import matplotlib
and set a few assumptions for our analysis.
Next, we’ll iterate through each country in the USDA dataset and use the is_producing_above
module to decide whether a country belongs in a list of ‘major’, ‘midsize’, or ‘minor’ producers.
Our major_producers
list, for example, contains the following countries: Brazil, Colombia, Ethiopia, Honduras, India, Indonesia, and Vietnam. These all represent countries that produced more than 5 million bags at least once in the past 5 years.(There are also 16 mid-size producers with between 0.5 and 5 million bags and 25 minor producers with less than 0.5 million bags.)
Once we have our lists, we can apply the attribute_query
function we built in Part 1 to get the production volume for each country. We’ll also apply the sum()
function to these mini-tables, which adds up the values for all countries in a given year. Now we have the total volume produced by all countries in our lists for each coffee year. We combine them into a new table and make a basic ‘stacked area’ plot of the result. The plot shows that the combined production from these countries has been declining since the 1990s.
We do something similar for our major producing countries and see that their production, as a whole, has risen dramatically since the 1990s.
Let’s combine our three tables into a single table and then focus our analysis just on the last 25 years of production. We’ll also transpose our table to make it easier to work with.
Here’s our initial (not very nice-looking) chart, produced by running: production_table.plot(kind='area', stacked=True, alpha=.4)
Now let’s make it look prettier:
- Change the production units from thousands to millions of bags
- Apply a color gradient (blues)
- Remove the legend (we’ll make our own labels)
- Tighten up the x-axis
Next, we annotate the chart with labels for each producing country. To get the vertical positioning right, we need to find the y-coordinate that represents the midpoint of each country’s stack. For example, Colombia should be at 73.5.
Our message would be even clearer if we show the growth rate (CAGR) for each country. This cagr
function returns a CAGR calculation as a string, which we’ll be able to append to our label.
Finally, we add the CAGR to the label and give our chart a title that summarizes the key insights from our analysis.
In the next post, we’ll build a map with our dataset.
View the complete notebook on github.