Iowa Liquor Data

For our second Data Science Immersive project we were asked to work with a huge spreadsheet of data from the Iowa Liquor board containing sales data on class E liquor licenses from 2015 and the first quarter of 2016. Using that data we had to build a model to predict the sales for all of 2016.

This is an excellent project to display the use of linear regressions.

I’ll walk through some of the steps I took to prepare and analyze the data.

First, I had to clean the data, convert some columns to datetime, rename a few columns, and find out which columns had null values, etc


A few examples of my data cleaning process.


I knew in advance what I wanted my model to look like:

I would predict the sales for the last 3/4 of the year based on the sales from the first quarter of the year.

To build the model, I needed to transform the data frame by summing all of the sales by store by quarter. I did this using a series of pandas groupby statements, which I then merged into one dataframe.

There’s definitely a much simpler way to do this, but if it works, it works.
Generating the final dataframe I can use to model.

Some store information had to be dropped in the final version of the model because there were some null values in the dataframe. Some stores didn’t have sales in one of the quarters, which meant I couldn’t include that in the modeling.

This isn’t ideal, but since there are millions of rows of data, I don’t think it will have a big impact on my model.


Variable Selection

I created some correlation matrices to see which variables are closely correlated and the answer was…all of them.

That’s not too surprising since they’re all describing the same thing — the amount of alcohol sold by store.

Since they’re all so related I’ll just chose one variable — the total sales per store by quarter.

Model Building

Since the goal of the model is to predict the sales for the entire year based on the first quarter, I built the model to take the first quarter 2015 sales and predict the quarter two, three and four sales for 2015.

Again, the model can be visualized like this:

So to create it I

Applying the model

Now that I have this model, I can use the model to predict the Q2,3 and 4 sales for 2016. By inputting the 2016 Q1 data like:

The results are:

So our total predicted sales for 2016 is:



It’s important to note that there are some shortcomings to this model.

This model does not account for any potential changes in economic conditions or tax policy in the remainder of 2016. We assume nothing changes in the larger economy between Q1 and the other quarters.

The model also only accounts for stores that had sales for Q1 of 2016. There may be new stores opening that could change our predictions.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.