Working with USDA’s coffee database: Part 2

numbers.coffee
3 min readFeb 16, 2019

--

In Part 1, we built a simple script to automate data extraction from the USDA’s website and put all of their coffee statistics into a Pandas DataFrame. We also wrote two modules that made it easier for us to manipulate and query the dataset.

In this post, we’ll do some useful analysis on our usda table and visualize the results with Matplotlib. The goal of our visualization will be to show which coffee producing countries are most dominant and how their market share has grown in recent years.

First, let’s import matplotlib and set a few assumptions for our analysis.

Next, we’ll iterate through each country in the USDA dataset and use the is_producing_above module to decide whether a country belongs in a list of ‘major’, ‘midsize’, or ‘minor’ producers.

Our major_producers list, for example, contains the following countries: Brazil, Colombia, Ethiopia, Honduras, India, Indonesia, and Vietnam. These all represent countries that produced more than 5 million bags at least once in the past 5 years.(There are also 16 mid-size producers with between 0.5 and 5 million bags and 25 minor producers with less than 0.5 million bags.)

Once we have our lists, we can apply the attribute_query function we built in Part 1 to get the production volume for each country. We’ll also apply the sum() function to these mini-tables, which adds up the values for all countries in a given year. Now we have the total volume produced by all countries in our lists for each coffee year. We combine them into a new table and make a basic ‘stacked area’ plot of the result. The plot shows that the combined production from these countries has been declining since the 1990s.

Coffee production from minor and mid-size producing countries has not increased since the 1990s

We do something similar for our major producing countries and see that their production, as a whole, has risen dramatically since the 1990s.

Coffee production from several major producing countries has increased rapidly since the 1990s

Let’s combine our three tables into a single table and then focus our analysis just on the last 25 years of production. We’ll also transpose our table to make it easier to work with.

Here’s our initial (not very nice-looking) chart, produced by running: production_table.plot(kind='area', stacked=True, alpha=.4)

Output of our analysis — V0

Now let’s make it look prettier:

  • Change the production units from thousands to millions of bags
  • Apply a color gradient (blues)
  • Remove the legend (we’ll make our own labels)
  • Tighten up the x-axis
Output of our analysis — V1

Next, we annotate the chart with labels for each producing country. To get the vertical positioning right, we need to find the y-coordinate that represents the midpoint of each country’s stack. For example, Colombia should be at 73.5.

Output of our analysis — V2

Our message would be even clearer if we show the growth rate (CAGR) for each country. This cagr function returns a CAGR calculation as a string, which we’ll be able to append to our label.

Finally, we add the CAGR to the label and give our chart a title that summarizes the key insights from our analysis.

Output of our analysis — V3

In the next post, we’ll build a map with our dataset.

View the complete notebook on github.

--

--