Colombian Coffee Prediction

8 min readApr 26, 2019

Hay una versión en español de este post, aquí

So one week from Bootcamp is gone, so fast. Week one from the Data Science Bootcamp I’m studying in New York (click here, to read about what a Bootcamp is and why I got myself into this).

Now is time to work on the next projects that are: personal (I choose the topic) and invividual (By myself, without a team, like the first project). Project #2, is all about scraping and Lineal Regression.

For this project I must:

Gather data using data scraping
Analize Data using Linear Regression, this includes: feature selection, categorical variables, time series, Regularization
Vizualise each step where possible to get a better view of the results and make a better decision.

As topic I decided to predict Coffee Production in Colombia for many reasons: I wanted a continuous variable to learn the best from Linear Regression, I wanted to analyze and scrape all sorts of data (weather, economical, etc.), and personally, I wanted a topic to tell a little bit about my country since many of my classmates had a not-very-positive impression.

Background:

Colombia is the second Coffee productor country in the world. Only 3% of the country produces 10% of the coffee consumed all over the globe. The purpose of this Project is to analize and determine what are those factors than influence high coffee production to ensure production for future Months. For that we will analyze several types of information:

Weather Data: Weather conditions and particularly, month of harvest, can influence the growing of the crops
Competitor Countries: Brazil is the biggest coffee productor, Vietnam has similar Production volume and other countries, might affect the production
Economical Factors: Most of the Coffee Produced in the country is exported, can exporting volumes (Demand) influence? Also, since Colombia doesn’t use the traditional market currency (dollar), the change rate between dollar and Colombian Peso should be analyzed as well

Extract Data From Sources

Gathering all this information was actually the main challenge for this project:

Weather Data was scraped using Scrapy from Weather Underground for the three most important areas in Colombia where coffee is produced
Coffee Production Data fom Colombia, as well as exporting volumes and dollar change rate were downloaded in csv format from Colombian Coffee Growers Organization webpage
Economical and Coffee Production information from other countries were gathered from the International Coffee Organization webpage

I’ll write a future post on Scraping with Scrapy which I found very challenging, but very cool and useful in the end

General EDA analysis for all variables

First, a general EDA analysis was made to find possible relations between variables and have a clearer view on the features that were going to be used in the final model.

We can see that the highest relation between variables can be found in economical data (Exportation Volumes and dollar rate), which makes sense, and coffee price between different countries, where a further analysis will be made. Weather variables, however, seem to be mode independent between each other in general.

Creating the Model

First Model

First of all, I split the data between training set (70%) and test set (30%), this is done to test the model with data it hasn’t “seen” before, this is, with data it wasn’t trained on, in order to see if the model gereralizes well and ot doesn’t “memorize” (overfits). Generaly the dataset is split into three parts, including the validation set as well, I didn’t use, in this case, a validation set, due to the low volume of rows I was working with: eight years of data, one row per month = 96 rows.

The very first model I ran was an Ordinary Linear Regression (OLS) using only weather variables scraped from Weather Underground webpage: Temperature, wind speed, precipitations, humidity, dew point (All average per month). I also included an Intercept to this model to give it more flexibility.

Results: R2(test) = 0.14, Pvalues between [0.039–0.745].
This means that some variables are relevant, but there’s some information missing. There’s still work to do…

Categorical Variables

Another variable that might be important, and its very related to weather, is the month in which the harvest was made. It is said the coffee grows better in may and october, so this could be a good variable to include in the model.

Now, although havest month is a variable that ranges from 1 to 12, it’s not an ordinal or numerical variable: There is no reason to say that february (value=2) is the double as january (value=1) or that they are both less than september (value=9) or december (value=12). All months are equally important, month-number does not represent an amount and it doesn’t give any order (In this part of the analysis, we’ll get to Time Series later). Harverst month is what we call a categorical variable, and instead of being treated as one column with values (1–12), we will transform it into 11 columns with values (1–0). Each row will have 10 values in 0 and one value in 1, the month it belongs to. This process is called One Hot Encoding and he reason we generate 11 columns and not 12 is to reduce redundancy and correlation between this new variables: If all rows from 1st to 11th (january to november) have the value 0, this means that the row belongs to december. There is no need to add a new column because it will create correlation between them.

Results: After including the new 11 features: R2(test) = 0.64, Pvalues between [0.002–0.759].
The Model is improving! Definitely, harvest month is a strong variable when it comes to Coffee Production.

Competing Country and Economical Analysis

Price vs Production in different countries

As you can see in the picture above, there seems to be a big negative relation between price of Coffee, in Colombia, Brazil or other countries, and the Production volume. Does this makes sense? Do lower prices lead to higher production volume? Actually, is the other way around: Higher volumes will make prices go lower, and low production will make the coffee more expensive. If you think about it, this is simple offer and demand theory.

Results: Production volume doesn’t depend on Coffee prices. On the contrary, prices are dependant from the production. Price in Colombia or in other countries shouldn’t be included in the analysis.

Time Series Analysis

We’ve already seen that harvest month is an important feature to determine Production Volume, but what about other months? Assuming that the process needed to export coffee apart from being harvested (Peeling, drying and in some cases grinding, all these processes in Colombia are done manually to get a better coffee!) takes around a month, lets look if the production is affected by the production from the month before. This is, is december production affected by november’s?, november’s by october’s? and so on? This is the simplest way to understand time series: adding the dependant variable from one month as feature for predicting the next month.

There are more ways to analyze time series,this is the simplest one. A complete analysis would include more months back, stationality, etc. I’m planning on writing a post on it as well

Results: R2(test) improved to 0.67. Below you can see an image of the predictions using only previous-month-production as feature: You can see that last months predictions (Purple stars, labeled Jan through March) are very close from the real values (Green stars). This model predicts for next month, a coffee production of 1.176 thousand bags, well just have to wait and see the results. The model improved so I’ll include previous-month-production as a feature in the model

Time Series Prediction using only one month back as variable for month Jan-Apr 2018

Update (May 2018): Real value for April 2018 was 1.037 thousands bags of coffee. Only about 130 thousand bags off from the models prediction!

Now, Hoy to find the best features? (Regularization)

Regularization is a process to find the most relevant features in our model. How can we do this? Mathematicaly this is done by adding a new term to our model’s equation representing the cost. There are multiple ways to do this, but all of them try to minimize the number or value of features included in the model. If we just added all features, the model will overfit, this is: it will accomodate very well to the training data but will do no good on test data. (The other way around would be underfit: The model will be too general and it won’t predict well in any of the cases).

The methods I used for this models where:

Lasso (L1): Takes the cost as the sum of the magnitude (absolute value) of all the coefficients. It will try to lower non-important feature to zero
Ridge (L2): Takes the cost as the sum of all the square of all the coefficients. It will try to make non-important features very low
Elastic-Net: Uses a mixture of both of this methods trying to determine, for each coefficient, the most optimal cost.

Feature importance according to each model y the final selected features

Results: I really love the graph above! It shows the absolute value of all features according to each of the regularization methods used (OLS is the original model without any regularization at all). The features marked in red arrows where the ones I chose (My own, personal, subjective decision, either because the models agreed on it, or because I saw some importance during my analysis and investigation) for the final model. R2(test)=0.783!!!

Conclusions:

Coffee Production, instead of what was believed in the beginning of this analysis, doesn’t seem to be affected by weather conditions. Coffee will grow in any temperature and humidity % inside the area of observations. Although protection against rain must be preserved. Precipitations seem to be the only weather significant factor

Economical facts from international competidors such as Brazil, Vietnam, Peru and Mexico, don’t seem to affect either. The production of these countries, along with Colombia, seem to be influenced all in paralell by other external features in common not included in this analysis.

Month of Harvest seem to be the most important feature in Coffee Production. Periods of May until August as well as October until December seem to have the highest impact.

If you like to look at the full analysis, the data or the code, check out my Github