How bad are current coffee prices by historical standards: Part 1

4 min readApr 26, 2019

This post provides some data analysis techniques to help answer an important question facing the coffee industry right now, “How bad are current coffee prices by historical standards?”

(See my post from six months ago “Is 100 the new 50?” for more context and analysis on the coffee price crisis.)

We’ll focus our analysis on Colombia, one of the countries hardest hit by the current slump in prices, and make a comparison of ‘nominal’ versus ‘real’ coffee prices for Colombian producers. Such analysis allows us to compare a price of around $0.90 per pound today with what it would have been worth in Colombia 15 years ago (spoiler alert: $1.40).

To do this, we’ll need to combine historical data on Arabica coffee futures prices, exchange rates (US Dollar to Colombian Peso), and local inflation. Our goal will be to have a single table that shows these three elements (price, exchange rate, and inflation) for any given date in history.

In Part 1, we’ll show how with a few lines of Python we can merge three different tables on a date column. Merging data from different sources into a single table can be painful in Excel and even more painful when working with dates (unless you like writing complex lookup and index functions). In Part 2, we’ll manipulate the data and perform our economic analysis.

To get started, we’ll use pandas to ingest our three data tables from csv files. (The tables used in this analysis are available here and represent public data compiled from Trading Economics and the International Commodity Exchange. Note: I removed some extraneous columns from the original datasets.)

Load our three datasets from .csv files

Each table contains three columns: a symbol (e.g., ‘USDCOP’), a value (e.g., 1295), and a date.

Dates can be tricky. The first issue with these dates is that they are in different formats. In the currency and inflation tables, the dates are stored in a DateTime column and take the ISO format %Y-%m-%dT%H:%M:%S. In the price table, the dates are in the format %m/%d/%y. See this great article for more reading on different date format conventions and how to work with them in Python.

The second issue is that each table represents a different time series. Our currency DataFrame shows daily USD to COP exchange rates, but appears to omit weekends and Colombian bank holidays. The inflation DataFrame shows year-over-year changes in Colombia’s consumer price index (CPI) at monthly intervals, but only for the last day of each month. The prices DataFrame shows daily ‘close’ of trading prices, but omits weekends and US bank holidays.

We can solve issue #1 by converting our different date formats into a standard format. The datetime library exists for this purpose and includes a module called strptime that converts a date in string format into a datetime object. We’ll apply this module across each of our three tables.

Convert all our dates to standardized datetime objects

The Date column in all three of our tables now holds values that look like this: ‘2019–03–29’:

The tail of our currency DataFrame — note the new ‘Date’ column

Next, we need to solve issue #2: getting a continuous time series. It would be ideal to have a series that contained all possible dates in our historical range — not just the dates that have overlapping data points. Doing this requires making categorical decisions about how to handle missing dates. Should the price for a Sunday date be derived from the most recent date of trading (e.g., a Friday) or the next day of trading (e.g., a Monday)?

There are probably more elegant, Pythonic ways of doing this, but here’s a simple, brute force method of creating a continuous time series using timedelta from datetime. We’ll focus our analysis on the date range 1 April 1999 to 31 March 2019 and create a new DataFrame that has a date for every day (including weekends and holidays) in that range.

The continuous_dates DataFrame contains all dates between 1 April 1999 and 31 March 2019

Pandas has a relatively new, magical function called merge_asof that allows us to perform merges even when there’s data missing. (A normal merge would only merge on column values that contain 1:1 matches, meaning if we merged our three tables on the Date column, we’d only get a merged table for dates that are common across all three tables.)

We will merge_asof our three indicator tables onto our continuous_dates DataFrame. We can also specify the direction of the merge. We’ll set direction='nearest' for currency (meaning it will fill missing dates with the nearest date that has a value), 'forward' for inflation (because this dataset only gives values for the last day of each month, 'forward' has the effect of ensuring all dates in a given month have the same value), and 'backward' for prices (meaning it will fill missing dates with the value from the last day of trading).

Applying merge_asof to create a continuous time series for the indicators in each of our three datasets

Finally, we’ll set our combined table’s index to theDate column and rename our columns to make them more intuitive.

Set the index to ‘Date’ and clean up our names a bit

And here’s a simple plot of df.plot(kind='line', secondary_y='Inflation') to show that it works:

Proof that our DataFrame has what we need in it! (Not a pretty chart.)

In the next post, we’ll do some manipulations on our dataset so we can compare historical coffee prices in real terms.

View the complete notebook on github.

How bad are current coffee prices by historical standards: Part 1

Written by numbers.coffee