Drink, Iowans. Drink!

Predicting Iowa Liquor Sales

Chaim Gluck
3 min readSep 6, 2017

--

For this project, I used a data set that is available here from the Iowa state website. (Fair warning: the full data set is huge! If you download the whole thing, it will take a long time to perform many kinds of operations on it.) The Iowa state government maintains a bunch of interesting, publicly available data sets that make for some really cool projects. This data is updated monthly, and is present to the current month, but the version I used for this project contained the information for 2015 and the first quarter of 2016. The goal of the project was to predict the amount of sales for the rest of 2016. Here is a look at the data set:

In order to do a project like this, you need some interesting tools. The first, and most fundamental, is Python. Python is a powerful, robust, flexible, and friendly programming language. That’s right, I said friendly. It’s easy syntax makes it a popular language for beginners and it’s unending malleability endears it to experts. The crazy thing about Python is that a beginner can write and implement code on their first day, and seasoned programmers still learn new tricks after years of coding.

First, I loaded the data into a Pandas Dataframe. Pandas is a Python library that gives us the ability to easily and flexibly manipulate and perform operations on large sets of data. It’s sort of like a spreadsheet, only much quicker and far more powerful. For comparison, Excel only allows you to use 1,048,576 rows. Pandas is only limited by your computer’s memory capacity. This data set had over 2 million rows, so Excel wouldn’t even have been able to hold it.

After prodding, pushing, and coaxing the data into the necessary form comes the most strategic aspect of the process, building a model. A model is how we describe the relationship between the information we have and that which we seek to predict. By understanding the relationship between the sales in the first quarter of 2015 and the entire 2015, we can extrapolate that relationship to predict 2016 using it’s first quarter. When model building you have to decide which variables you have (or can create from information you have) that will most accurately and effectively model that relationship.

For this model, I used a technique called Linear Regression. Linear Regression is a relatively simplistic and very common model. It uses some algebra and calculus to create a line that best describes the relationship between the variables with the least amount of error. Then, when we obtain new data, that line will give us an estimate of where the new points fall out on the target variable.

Another interesting technique I used is cross-validation, which splits the data set into a bunch of parts and trains the model on all the parts except one and then tests it’s performance on the leftover one. It saves the model’s score, and then iteratively performs the same process on all the other parts. This process ensures that we’re not building a model that is fit too well to the data we’re training it on, but wouldn’t perform well on new data. If you run cross-validation, and scores for each section are very different, you know that your model isn’t generalizing well across different data.

My regression model predicted that sales would rise about 2.6% in 2016 over what they were in 2015. This was largely based off increased sales in the first quarter of 2016, which the model extrapolated to the rest of the year.

This project was the first I’ve done with a data set this large. It was the first time that I had to worry about writing efficient code to process my data quickly. I had new challenges, and I learned new things.

--

--

Chaim Gluck

Freelance Data Scientist. Published by Towards Data Science. Say hello at www.linkedin.com/in/chaimgluck