Asking ‘Data’ Questions

Sarah Smith
4 min readNov 11, 2019

--

Using Data Science to answer business questions.

As I move further through my Data Science bootcamp, my mind keeps thinking about ways I could apply my new skills — and I keep thinking back to my previous job as a restaurant manager. The owner of the restaurant always believed in “running the business by the numbers”, but, despite having a great sales system with a breakdown of our daily sales, we never knew quite what to do with them. This reminds me of something Hugo Bowne-Anderson said to us in a talk about the Data Science industry — that Data Scientists need to:

“take a business question and turn it into a data science question, get a data answer and turn that back into a business answer.”

With this in mind, I think that in my previous position, not only did I not have the data science skills necessary to evaluate the data, but the business owner did not know the question he wanted answered (which is a topic I will explore at a later date).
This brings me to the main subject of this blog, which is somewhat of a continuation of my previous blog about the brewing industry, and Matthew’s blog about uses of machine learning in biological processes. When Matt and I were paired for a project, we wanted to take our love for craft beer and see how data science could help the industry (at a very basic level). We reached out to a number of breweries, not quite sure what question we were looking to answer, but hoping for some data that we could explore and potentially discover a question we could answer. Spoiler alert: I didn’t discover that question!

We were lucky enough to get data from one of the breweries we reached out to! Here’s a little preview of what the data looks like:

We were told what the stages stood for, but that was it. From my (limited) domain knowledge we were able to work out what the rest of the columns were referring to. There are a number of different ways of approaching this data — we have data on individual batches at 3–4 different stages of the process. We could look at this at a batch level or at a stage level.

Our initial thought was that this could be a time-series type problem. In this case we would look on a stage level, and use that model to predict what the next stage in the process will be (and as such, be able to predict when fermentation finished). The other approach to this would be that of a predictive model on a batch level — use that data to tell whether or not a batch was finished fermenting.

To begin I looked at the stages of each batch, and as with most EDA my first step was to draw some graphs!

hmmmm…

Pretty quickly I realized that actually, this was not going to work as a time series model, as the stages weren’t really time dependent. Yes, they do change with time, but the factor that we’re measuring is more a product of how the properties of the beer change.

I then moved on in my thought process to explore the idea of a predictive model on a batch level, but I was halted in my tracks. For this I would need readings of the beer when it was not yet fermented, however — all of the readings for OE, AE & RE that we have begin at the ‘EF’ stage, which is the end of fermentation.

At the end of this exploration, two things became obvious to me:
1. Realistically, the best way to know if a beer is fermented is to simply monitor the ‘RDF’.
2. We don’t have enough data to build a predictive model, we would need at least initial starting points for each batch in order to be able to build a model to predict when each batch was finished fermenting.

While it may seem that I didn’t get anything that I set out to achieve, this has been an excellent learning process for me. Here are some of the main things I’ve taken away from this:

  • Domain knowledge is important.
  • Having the right data is important.
  • Just because you have data doesn’t mean you will get something from it without posing the right question.
  • Just because you can build a machine learning model, doesn’t always mean you should.
  • Even if you do have a ‘business question’ turning that into a data question is harder than you might think, and a skill that needs to be developed.

In summary, while we couldn’t build a model in this case, I do think that there is still more to be explored in the brewing industry. The fermentation stage just might not be the best place to start, I think that if you were to look at recipe development you could potentially optimize the process at the start rather than at the end.

--

--

Sarah Smith
Sarah Smith

Written by Sarah Smith

Data Scientist @ Flatiron, passionate about data, science and beer.