Clustering Popular Beers

Identifying Non-Obvious Insight

Published in

INST414: Data Science Techniques

3 min readDec 7, 2023

In this post I look to use data from popular beers to perform an in depth cluster analysis. I am performing this exploratory cluster analysis in order to identify certain groups or categories among popular beers based on certain characteristics. To extract a non-obvious insight from this dataset, we could aim to understand the relationships and patterns between different attributes of beers, such as Alcohol by Volume (ABV), International Bitterness Units (IBU), pH levels, and others. This insight could inform decisions like identifying unique beer profiles for targeted marketing, predicting consumer preferences, or guiding new beer development based on popular clusters.

Data Source, Features and Similarity Metrics

In order to source this data on beer, I wrote some code in order to access one of the free APIs on the course GitHub with data on popular types of beer. I wrote some code to access the API and convert it to a JSON file. The code is snippet is provided here:

Features used for determining similarity include ABV, IBU, pH, attenuation level, and target original gravity (OG). These features are indicative of beer characteristics like strength, bitterness, acidity, and fermentation progress.

I used the KMeans clustering algorithm. KMeans uses Euclidean distance as the similarity metric. It’s suitable here as we’re dealing with numerical, continuous data.

Selection of k Value

If k = 4 was chosen as (for this dataset) four clusters provide a balance between having too few clusters (which might oversimplify the data) and too many (which might overcomplicate the interpretation). Four clusters also represent a meaningful differentiation in the beer dataset. It effectively distinguishes between various types of beers like ales, lagers, stouts, and IPAs, each having unique characteristics in terms of ABV, IBU, and other factors.

Interpretation of Clusters

Each cluster represents a distinct group of beers based on the chosen features. For example, one cluster might represent high ABV, low IBU beers (strong but not bitter), while another could be low ABV, high IBU (light but bitter). Below are four clusters of beers:

Software & Tools Used and Data Cleaning

The analysis employs Python libraries: requests for API data fetching, pandas for data manipulation (filtering of relevant columns), numpy for numerical operations and calculation, SKLearn for KMeans clustering and finally matplotlib and seaborn for visualization like the one listed below:

The cleaning for this dataset included filtering relevant columns, removing rows with null values. Missing values are filled with the mean of their respective columns. Infinite values are replaced with NaN and then dropped. Much of the debugging process was fixing syntax errors in my code along with figuring out how to create charts like the one provided above. To solve this issue I referenced YouTube and Stack Overflow to see examples of similarly created charts / figures.

Limitations and Biases

The analysis might be biased due to the choice of features. Some relevant features might be excluded, skewing the clusters. The approach might oversimplify the complexity of beer flavors and profiles, as it reduces them to just a few numerical parameters. The analysis may be limited as the list of beers provided in the dataset is not a comprehensive list of every single beer in the United States. There are many local breweries that I’m sure, if they were to be added, could alter my findings and conclusions.

Link to my code in GitHub:

https://github.com/dav1dalvaro/INST414/blob/main/Assignment4127.ipynb