Data Science

Collecting a Hobby Dataset: Coffee

A short journey in curating a wild dataset

Robert McKeon Aloe
Towards Data Science

--

Data science in the modern era has allowed people to take a bootcamp where they’re given cleaned up datasets to train classifiers on. While the results can be exciting, doing the analysis or the training is only a small part of being a data scientist. If anything, the design of experiment, data collection, data validation, and data cleaning consume the majority of your time. If you have done all those things well, then the training and analysis is straight forward and even time.

Professionally, I have done all of these steps individually and as part of a team. Personally, I have collected data sets for things I’m interested in.

When I wanted to buy a new car, I built a small dataset of the key features I wanted. Once I had all the features in front of me, the best option was clear, and oddly enough, it was not determined by what I thought would be the determining factor. All things being equal for cars meant only small features distinguished them like CarPlay.

When I wanted to better understand turnover and the health of my previous company, I started building a dataset. This required patience. Nobody handed or wanted to hand me data. In fact, my data came from something as simple as the phone list for the company. As time went on, my dataset got clearer, and after a few years, it was clear the company was going under based on the unsustainable turnover rate.

Coffee Dataset

A few months ago, I became interested in looking for a large dataset of Q-grades for coffee. Q-grading is a way to compare coffees using a standardized taste grading system. I wanted to do some analysis on them to see how useful they could be in determining better blends or explaining why blends work well.

All images by author unless stated

While I found one dataset (CQI) that was pretty well curated, I really wanted to build a more useful dataset from Sweet Maria’s because that is where I buy the majority of my green coffee. Sweet Maria’s also has coffee flavor notes, which were not contained in the CQI dataset.

The catch: it was a lot of manual work.

Raw Data

Sweet Maria’s has all their old coffees as archived pages, and for each one, you get the price, the total Q-score, and some images. Each has an image showing the submetrics of the Q-score and an image showing flavor notes. Both of these images are shown as a spider graph.

Images reproduced with permission by Sweet Maria’s

There were 300+ entries in this list. I had to view all the coffees on one page, tap the quick summary for each, tap the Q-score graph, then select all, and copy. I repeated this for the flavor graphs. This process took about two hours. I ended up doing this twice because the first time I built this database, I didn’t collect the flavor grades.

Screen Captures, fair-use

Also, I tapped compare for all of them so I could get the meta data like processing method, region, and cultivar.

Screen capture, fair-use

Score Extraction

For Q-score and flavor spider graphs, I wrote a script to segment the image, identify the circles and then extract the scores.

For flavors, I had to modify the script slightly, but I was able to pull the scores relatively easily.

The Q-scores had a final score that I had separately, and there was a cupper’s correction. I wrote a script to manually enter the cupper’s correction so I could help validate the total cupper’s score. This allowed me to cycle through all 300 images in a few minutes.

I used these two pieces of information to calculate the average error across the submetrics of the Q-scores, and I corrected all the submetrics by this error. As a result, I ended up with an error of less than 0.1 per sub metric on the data when I did some data validation.

Then I validated the data by randomly sampling the data, looking at the images, and verifying the extracted submetric scores matched the image. This took some time, but it was helpful in making sure my spider graph extraction was working correctly.

Then I linked this to the meta data, and the dataset is ready for processing.

Data Analysis

I used this data to do understand the quality of each submetric with respect to the total, how similar Q-grades of Sweet Maria’s are to each other, and how similar Q-scores from Sweet Maria’s are to CQI Q-scores.

Here I showed each metric correlated to each other:

Here is how these scores can be used to compare coffees to one another and understand how coffees are similar:

Overall, the work was time consuming, and the payoff was often unclear. The resulting analysis could have been a waste of time, but the more exciting part of the analysis was preparing for the analysis. Otherwise, the analysis would not have been as interesting nor would I have felt such an emotional attachment.

--

--

Responses (3)