Chatting With Data Scientist Amanda Dobbyn About Analyzing Beer Styles

Earlybird data scientist Amanda Dobbyn presenting at a recent R Ladies Chicago meetup.

This is the latest installment in our occasional series of interviews with Earlybird technologists on a variety of issues relevant to our work. Have any questions or ideas for topics you’d like to see us cover? Drop us a note at hello@earlybird.co.

Eddie VanBogaert, Partner: Alright! We’re here with Amanda Dobbyn, a data scientist here at Earlybird, and before we dig into one of your latest projects, let’s start with your role here at Earlybird. What did you do before joining us, and what’s a typical day or week in the life like?

Amanda Dobbyn, Data Scientist: Sure. So I came to Earlybird from UChicago where I did statistical analyses on experimental data in a cognitive neuroscience lab. There I had the chance to get steeped in cognitive psych and neuroscience literature — lots of cultural relativity, linguistic relativity, experiments testing how the language you speak influences the way you think — that sort of interesting academic stuff. Through this, I was able to get a solid foundation in responsible statistics: I did a lot of checking assumptions, making plots of residuals, measuring skew, that kind of thing. And you know your methods are going to be scrutinized by other academics in the journal space, so you better make sure you’re doing it right and anticipating criticism of your process and biases. If you’re going to [take the natural logarithm of] your data you’d better be able to back up why that was the right choice.

Anyway, academia was great — met tons of super smart and passionate people who are still my good friends — but it moved a little slowly for me, and I wanted to do something in quicker iteration cycles and have a more practical impact. And that’s when I found Earlybird. My last year and a half or so here has been an interesting mix of data science, some project management under [Earlybird co-founder Vlad Jornitski], and generally learning a lot about the tech industry and the data software tools being used in the private sector.

The data science that I do tends to be a good bit of hypothesis testing — both our clients’ hypotheses about their data and hypotheses we generate ourselves — as well as more exploratory stuff generally starting with visualizations and then branching off into things like network analysis and predictive modeling. Modeling, for instance, can help us expect when a certain customer might return to a service department or retail location, based on their past history, known demographic details, and even factors like what day of the month it is and other market trends. I generally like joining on third party data like that as a control to give us indications of how much of the effect we see is actually attributable to business decisions our clients have made, and how much is probably a result of external factors beyond their control.

We’re generally allowed good degree of creativity, a wide berth to suss out the parts of the data that seem odd or don’t make sense, and dive into them to see if it’s a situation where our insights can add some value. So I’ll do that type of end-to-end analysis, polish it up, and then present what I’ve found to our clients along with our prescriptive recommendations for how the client can use the data to inform their decisions in a particular functional area.

As for my day-to-day? Well, it always starts with morning stand-ups along with the rest of the crew where I wear my PM hat and get a sense of where everyone is, what they’re blocked on, where the bugs are, what’s the most pressing, what might be lower priority. Then, depending on the day, there’s a smattering of meetings mixed in with bursts of coding where I fit in work on client-facing data projects and sometimes analyses internal to our production operations here at Earlybird. During these I’m coding or bouncing ideas off other members of the team. In my PM role, at different stages of different projects, I might be in the conference room with a certain project’s team, whiteboarding out a kink in our business logic and brainstorming the best possible workarounds or fixes for it. Other times I’ll be talking to our clients to get a clearer sense of their vision for a certain feature, or checking people’s work in reviews and making sure things look good functionally, ensuring we’re on-track to put together a product that we’re going to be proud of.

Eddie: Cool. So, beer — beer is a different story… we have beer here sometimes. And you shared a project recently — a personal project, not an Earlybird project — with the R Ladies group here in Chicago as part of their Oktoberfest meetup. Tell me about that. What were your overarching questions, or what were you looking to discover?

Amanda: So, funny enough, this did actually start in the office. Our good friend [senior developer Kris Kroski] and I were having a talk after work one day — may or may not have been over some beers — and Kris shared a beer app he’s been building as a side project.

I’ll spare you the exact details, but a lot of what the app is trying to do is identify people’s ideal flavor palettes based on their ratings of different types of beer. As you know, there are a lot of different styles of beer, and we were discussing how you’d want to go about objectively measuring the intensity of flavors in a way that makes sense across these different styles. This is sort of an interesting problem because, for example, what might be quite hoppy for a wheat beer might be considered average or even low hoppiness for an IPA (India Pale Ale) — that sort of thing. So you want to set up some sort of [Bayesian] prior that’ll tell you what distribution of hoppiness you should expect from a wheat beer and that’ll be different from the distribution among IPAs.

That discussion led me to ask a bit of a different question: do styles meaningfully define true flavor boundaries in beer? A reason we might think they don’t is that a lot styles seem to emerge as an accident of circumstance or history, right — take the impact of German purity laws, for instance. So is the labeled style of a beer actually a useful construct for understanding what you’re about to drink, or is it a bit more random than that? And somewhat secondarily, I was also interested in more fully discovering what the craft brewing space looks like today, and seeing what patterns or trends I might be able to find in the data.

Kris showed me how he had access to some beer data through an online beer website called BreweryDB. It has a public API (application programming interface) and so all you have to do is create an API key and you can request their data back as JSON. So, instead of just stealing Kroski’s data, I set about writing a few scripts that would get me all the data I was interested in, built up my own MySQL database locally, and then just dove in.

I’ve been into beer for the past few years, home-brewed a bit — brewed some of President Obama’s White House Honey Ale with an old [ultimate frisbee] coach — and so I’ve been interested in how beer is brewed, the chemistry of it (at a very high level), and what makes various styles of beer taste so different. So, again, the main crux of my analysis was to see, across this wide range, whether styles do a good job of defining and classifying different kinds of beer.

Eddie: That’s great. Tell us how you started your analysis.

Amanda: So first I wanted to do a bit of factor-level reduction on my outcome variable to condense all these styles that had slightly different names but really were under the umbrella of a broader style, grouping them all into that main style heading. I did that by defining those broader categories — and you can certainly take issue with how I chose to define them, and please do — and then I looked for styles that contained the name of that category within them and lumped them under the broader heading. There was a little more nuance than that so I was able retain the difference between, for example, a Double India Pale Lager and an India Pale Lager, but that was a fun bit of string munging acrobatics.

Then, when I started trying to answer the main question, the first way I went at it was by using what’s called unsupervised clustering. I often like to begin with clustering because it lets you look at the data without biasing yourself by labeling it with the thing that that you’re interested in. So stepping back for a sec — what an unsupervised machine learning technique will do is approach a problem agnostic to what you’re looking to find, which in my case was the relationship between style and the different measurable dimensions of beer available through BreweryDB: ABV (alcohol by volume), IBUs (international bitterness units), and SRM (standard reference method), a standardized color system used by brewers. Then, once you cluster your datapoints using a set of input variables, you can see how those clusters map onto the outcome variable you’re interested. If they line up well, then you’re more inclined to believe that there is a meaningful relationship to be defined between your input variables and your response variable.

I should mention that the predictor variables I chose were a deliberate subset of the possible variables I could have used. This is actually the type of thing I have to think about at work quite a bit, and it really requires logic and reasoning more than anything else. Reason being that, instead of always blindly throwing all possible features into a model, it’s better to think things through and say, okay, what really belongs in this model and what doesn’t and why? If you’re not thorough, your model can be, at best, hard to interpret, and, at worst, just wrong or misleading.

So in the beer case, I came up with a way to reason about which variables were good candidates for being predictors (and which weren’t) by thinking up some heuristics for classifying variables as either beer inputs, outputs, or style-defined attributes. I consider outputs the best, and I’ll try to give some reasons why — I define inputs as things a brewer can directly control, like the temperature a beer is brewed at or the amount of a certain malt that’s added. An output would be something that a brewer can’t directly touch or affect, like ABV or IBU. Once the beer’s brewed, it’s brewed, and you can’t bump up the ABV of a beer without doing something it, something that no young beer should have to go through, like adding vodka or whatever. I considered outputs better candidates for these models because they’re what we as the drinkers interface with. If you’re a brewer and you’ve brewed something of indeterminate style and you’re trying to put a label on it, these are the cues that you use.

Inputs are problematic because you’ve got a chicken-and-egg problem of which direction the causality goes in — a brewer might have a certain style in mind and adjust the ingredients they put into the beer or the conditions under which it’s brewed such that it becomes that style. So if they set out to brew a kolsch, they’ll probably use kolschy ingredients and processes and then if, bam, it turns out more like a wheat beer they might still call it a kolsch anyway. I could be convinced that inputs are good things to consider (and in fact I ran a few models that did incorporate them), but what clearly is not a good predictor variable is a style-defined variable, such as glassware or serving temperature. Glassware is entirely dependent on style. You serve x-beer in y-vessel because that stein or tulip glass is prescribed for that style. It wouldn’t be useful to use it as a predictor because a glass is perfectly correlated with the beers that are served in that glass.

Another thing to note here is that these input features are all on different scales: IBU runs between about 0 and 120, ABV, for beers, is typically anywhere from 2.5% or 3.0% to 20.0% on the high end, and the color scale runs 0 to 40. So you’d typically standardize them to the same scale in order to make sure that the ones on the larger scales don’t have an inordinately large impact as compared to the ones on smaller scales.

With k-means clustering, the type of algorithm I was using, you provide a certain number of clusters that you want data segmented into, and the algorithm will sort each data point into the a certain cluster. So it’ll go through iteratively and say, okay, this data point, does it minimize the distance from here to the center of the cluster we’ve assigned it to? If not, okay, let’s put it into a different cluster. Those clusters can be of different sizes and get presented in multidimensional space, where I eventually added total number of hops and total number of added malts as inputs as well, to try and get some better differentiation.

So that’s where I started. I try to usually start my analyses in this sort of way, unless there’s a highly specific questions we’re trying to answer, and we have a strong reason for going straight into a supervised learning model. But clustering, in this case, was a good way for me to see the overall landscape at the start.

However, the limitation I ran into was that I was generating one graph at a time, on either the entire dataset the dataset filtered to a single cluster or a single style. What I was interested in looking for in the filtered case was how homogenous each style was when we look at the clusters the beers in that style were assigned to, and, on the flip side, when we look at a single cluster, whether the majority of the beers in that cluster come from a certain style. That’s a lot of graphs to be generating for 40-odd styles when you want to poke around. So I ended up building a small app that allows the user to choose to display certain things on the fly and the algorithm will rerun as necessary. You can vary the number of clusters, filter to certain styles, add and remove input variables, and see a bit of the underlying data.

Eddie: And you used Shiny to build the app, yes?

Amanda: Exactly. RStudio developed Shiny as a way for data scientists to better be able to play with their data without having to learn D3 [JavaScript library] and other heavier or more complex software development tools. They make it pretty easy to host your app on their servers, configure the type of instance you need, monitor its usage, all that good stuff.

So yeah, after the clustering approach where I was messing around to see whether, in looking at a single style, most of the data points in that style map up to one single cluster or whether it’s more spread across the board. Then, from there, I took a more supervised learning approach. The question there was a little different: can we accurately classify a given beer into its correct style given these predictor variables? Now, granted, this was being done without the sorts of variables that Kris is trying to gather with his consumer-facing app, such as ratings and flavor profiles; we’re relying on a small subset of possible variables, and of course working without the benefit of human feedback.

Eddie: Great point. So what’d you do next?

Amanda: The next thing I did was move into supervised learning. This is what a lot of machine learning focuses on, looking at the question of whether we can classify new data based on the data that came before it. You train a model on a bunch of data, and then you attempt to classify on top of that foundation.

I used a couple different methods. One was a random forest algorithm, which is essentially a collection of decision trees that all vote on a classification together. You might think that a decision tree by itself would be a decent way to classify data, like a basic taxonomy that groups high IBUs here or low ABVs there, moving all the way down the tree until you’ve got every possible style, every outcome variable. But using a single tree tends to overfit the data. Of course that’s one of the major problems in machine learning — overfitting the data you have — since you don’t know when there might be a flock of Black Swans you didn’t see coming in the next batch of data.

What a random forest does is intentionally inject randomness into an algorithm. You take a bunch of decision trees and train them only on some subset of the data, and sometimes using just a subset of the predictor variables. Naturally, working on a subset, they don’t train as closely to the entire population, but that’s actually a good thing, because it avoids overfitting the whole of your data.

Eddie: Can you expand a little more on why that’s so crucial? Maybe give an example that explains why overfitting a model is such an issue?

Amanda: Oh, yeah, sure — a model that’s overfit is one that relies too closely on the training data that you’ve given it, and that could bias its outcomes when you give it new testing data. This is a problem when the distribution of a variable in our sample set doesn’t accurately reflect the corresponding distribution in the full population. Generally we expect it to, if we’ve actually got a random sample, and if we take many random samples and we’re just interested in the population mean for instance, we can expect over repeated samples we’ll arrive at something close to the that thanks to the Central Limit Theorem. But, other times our sample is biased in some way that we don’t know about.

So let’s say we’re looking at the population of the United States, and for whatever reason, we accidentally only surveyed people in California. We train a model on only Californians trying to predict, I don’t know, say number of words spoken per minute — like how quickly people talk. We’ve got our model predicting how fast someone will talk given their age, race, level of schooling, favorite ice cream, whatever. But then if we test the model on people from New York, they’re going to be fundamentally different in some key ways from people in California. We’ll probably predict that they speak way slower than they actually do because we’ve only trained on these laid-back West Coasters.

Similar principle here. If we’re training only a single tree on say 80% of our data and testing on 20%, if we train too closely to that 80% then we run the risk of overfitting to whatever nuances showed up in that 80% and then when it comes time to test on the 20% we’re actually not able to classify that accurately. Same principle in the broader sense, so with our sample of data, these beers — and there must be more beers out there, I only have about 60,000 of them in this set…

Eddie: That’s a lot of beers!

Amanda: It is, but it’s still small data, and you don’t want to overfit a single tree to those 60,000 or so because it might not be representative of the total population of beers out there. So that’s where a random forest helps. And basically, at the end, each decision tree will classify given a beer into a single outcome, a single style, and then they’ll all vote. So suppose we have seven trees and four of them say it’s a stout and three of them say it’s a porter, then okay, we’ll label it a stout — that kind of deal, but not always that closely divided.

While we’re talking accuracy, I’d tend to get around 40% accuracy, so meaning that, when presented with our input variables for an unknown beer, the algorithm was able to correctly identify its labeled style a little less than half the time. But remember the premise, right? I wasn’t necessarily expecting that styles naturally demarcated the beer landscape well, so high accuracy wasn’t the goal like it normally is. Instead, the classification model was a test to see whether the variables that we had were enough, in and of themselves, to be a strong indicator of whether a beer was one style or another. A high degree of accuracy and we would have been able to be like yes, style does seem to be a useful construct in separating beers. But a low accuracy measure doesn’t necessarily mean that the beer landscape is just a mush of different styles; it could be that we just don’t have enough of the right features in the model. Which I think is a very real possibility.

So, yeah, haha, on its face the accuracy measure seems not great, but random chance would’ve been like 3%, so not the worst either. And that’s without doing a ton of hyperparameter tuning or anything — this wasn’t for work…

Eddie: Haha, yeah, we’ll wait until Revolution calls us up…

Amanda: Ah, yes, the dream! Anyway, I was pretty happy with that, given only alcohol, bitterness, and color. And with a lot of these, I think an educated consumer would be able to distinguish them if they tasted them, but only because maybe there’s a wheaty flavor or carbonation that obviously wasn’t part of our training data.

Eddie: Did you try any other methods?

Amanda: Yeah, so, after that, I decided, for contrast, to run a small neural net to essentially do the exact same task from a different perspective. Neural nets use a totally different architecture and are based very loosely on how neurons fire and wire in the human brain. In an artificial neural net you can generate a classification for a given input — and you can do a regression as well, but we were interested in classifying here — so small neural net, one hidden layer, really simple, just kind of dipping my toes into another distinct approach.

Eddie: And did that work better or worse than the random forest?

Amanda: It actually worked about the same — around 40% — which is, to the point, either a sign that we don’t have all the right variables or that the “lines” separating styles are fairly blurry. This sort of problem is well-suited to random forests, which are usually pretty good at classification when applied properly. In this case, yeah, maybe I could broaden the input variables or tune the models better, but perhaps it starts to suggest that the common styles don’t do as good of a job describing beers as everyone thinks.

Eddie: That’s really interesting. Was this what surprised you the most? Or was there something else in your findings that really stood out?

Amanda: Hmm… good question…

Eddie: Maybe you weren’t surprised by anything?

Amanda: So, this wasn’t an experimental finding, per se, but I was a little surprised by how dirty the data was. And that’s no knock on BreweryDB — their documentation was really great and they’re a community-run operation — but it’s a reminder that even well-curated databases can have data cleanliness issues. There was a good amount of beer names that, say, started with quotes and jumped to the top of the alphabetical list, that kind of thing.

I did do a little foray into hops to see whether if you added more hops — not by volume, but the total number of distinct varieties — the bitterness of the beer increases. If you think about it there’s no a priori reason to think that there’d necessarily be a causal relationship there because you could add a bucket of a single type of hops to one beer versus adding a few teaspoons each of five different types of hops to a second beer and the first beer with the bucket of a single type of hops would definitely be more bitter. But, still, not too surprising there was a significant trend positive: the more types of hops you had, the more bitter the beer tended to be on average, and that had some style implications.

Even though most of the findings weren’t too surprising, I still figured the project was worth showing to some people who might be interested in seeing the approach, or interested in contributing. So when asked, I drafted everything up (using an R markdown format that lets you combine codes and slides pretty seamlessly and plus is easier to version control than something like PowerPoint) and presented the findings at RLadies Chicago. I was really excited to see how people engaged with the project and tested my assumptions, methods, code, all that — on the spot. It was a great night and part of the reason why I’ve become a co-organizer of the group. I love the collaborative and insightful energy these women bring to data and programming problems, and the organization is a great space where those ideas are heard and valued. It’s taught me a ton already.

Eddie: That’s awesome. I was really glad to hear you did that.

Amanda hosting a recent Lunch-and-Learn presentation on behavioral economics at Earlybird.

Eddie: Switching gears a little — you mentioned data cleanliness being a factor — what would you tell a prospective client about using this sort of data or public datasets in general? Are cleanliness and availability still leading factors in businesses getting the most from data science efforts? Companies that are new to using data to solve problems — where are they stumbling?

Amanda: It really depends company to company. In my experience, newer companies tend to have cleaner data, at least for their internal operations, because it’s not so legacy and not cobbled together from what used to be flat files or converted from antiquated databases. But I think data like anything tends toward entropy. Unless you work to maintain a dataset, keeping up with changes and keeping up strong documentation, data tends toward dirtiness.

So I guess the advice I’d have for businesses — other than call us, haha — is to find any way possible to get around collecting or inputting data by hand. Humans are human, and if you must use a human-facing form, it’s better to use a dropdown or autocomplete — something with a set number of choices that’s stored as an enum in the database. That’s almost always better than having to go back later and do some NLP (natural language processing) stuff on any free text that’s been entered. Of course, there are plenty of times when that’s unavoidable, and we’re definitely not against using NLP around here…

Eddie: Those tools have gotten a lot better.

Amanda: Absolutely. But yeah, collecting fewer things manually. Nothing super advanced or overly technical, just a matter of keeping operational data as clean and useful as possible.

And that’s not to say that I don’t like cleaning up data. They say that 80% of a data scientist’s day is spent cleaning and munging data, and they’re not wrong. If anything it can be more. But in my book that’s the fun part. You’re wrangling chaos into neat tabular format of values that can be meaningfully compared to each other to uncover real relationships between things you’re interested in.

Eddie: Cool. That’s great. So let’s wrap it up here with some quick final questions: favorite kind of beer, least favorite kind of beer, and most overrated beer?

Amanda: Haha — least favorite and most overrated might be the same… I’m going to take the opportunity to knock sours here…

Eddie: Oh no!

Amanda: Yep. I’ll just say it: sours are barely beer. We’re talking like bastardized cider here. Now, I’m here for the ciders, but I’m not here for the sours. They’ll ruin your palette for the night, and I’m sure there’s a good one out there somewhere, but I certainly haven’t met it yet.

As for a favorite? Hmm… I started off being a real fan of wheat beers — especially German hefeweizens, you gotta love a good witbier — “vit” beer if you’re Belgian…

Eddie: And I am…

Amanda: Ha! Of course. But now, I’m on the IPA train and can’t seem to get off…

Eddie: Haha, nothing wrong with that. Anyway, thanks, Amanda, for sitting down and sharing some knowledge with us. We’ll include some links to your GitHub repository and presentation for the project, the Shiny app you mentioned, and encourage anyone reading to drop us a note at hello@earlybird.co if they have any questions, suggestions, or sour-loving hate mail.

Amanda Dobbyn is a data scientist at Earlybird and one of our project management leads. She holds a degree in psychology from the University of Chicago, where she was previously a research fellow in the Experience and Cognition Lab.

Transcript has been lightly edited for accuracy and readability.