My First Jump Into Big Data
This past week I had my first experience with a really huge dataset. Project 2 contained nearly 3 million rows of data from the Iowa Liquor sales data, a data-set maintained by the State of Iowa to keep track of its liquor distribution system (which is pretty crazy in of itself, more info on how it works can be found here: Iowa Liquor Info). Each row in the data set represented a sale from the state to a private retailer, with column info including store numbers, sale totals, bottle volumes and prices, as well as geographic location by city, ZIP code, and counties. The goal of the project was to clean this data-set from null values, formatting errors, look for outliers, and all the other fun EDA type things, as well as create what we thought would be a best predictive model of sales for the coming year based off all of the info that we had. The two main project learning take-aways were supposed to be using pandas to manipulate the data frame, and modeling.
While modeling is definitely the “sexier” of the data science terms, my write-up will be about some cool (for a data scientist at least) pandas code snippets that I used to play around with the data-set. Learning to “view” the data without actually being able to even look at it (keep in mind that the final shape of this data frame was 2.7 million rows by 22 columns) was a really cool experience for me and something I would like to write about.
NOTE: For the following examples I am using a ten percent data sample to make the numbers easier to work with.
During my initial EDA the total missing values for our data-set was 2854. As you can see the missing values are all from county name and county number, and category name and category number. Since those counties/numbers and categories/numbers are all referenced elsewhere in the data-set, I looked for a way that I could fill in those missing values by plugging them in from somewhere else. I began by creating a dictionary of each unique county name/county number and category name/ category number in both directions.
I then created a “rule”, which I named under the variable ‘mask’, for each type of missing value and ran it against my data-set, giving me a list of Trues and Falses for all of those columns.
NOTE: I will use column one as an example.
I then ran a map function referencing each dictionary against a data-frame of mask1, which is all of my Trues(that it was missing the category name) and used the dictionary of category numbers to fill in the values for those column names.
All in all I filled in over 550 values, the remaining null values being, either missing both county number and county name, or that category name wasn’t referenced by a category number anywhere else in the data-frame. This is just some of the really cool things that you can do with pandas that I discovered in this project.