How I need you.
To store finished models,
for my CPU.
As Data Scientists, a key part of our workflow is generating models. More often than not, we are fitting multiple models to our data to find which one works best and providing analysis based on the results. This means that if we’re dealing with 30,000 rows and 100 columns of data — which isn’t atypical — modeling data may take a very long time. Furthermore, if we’re grid-searching to find the best hyperparameters for each model, the time it takes to fit a model and get results increases significantly. And there’s the problem, having to rerun those models over and over again each time you run code can be computationally expensive and waste a lot of time. Thankfully, enough people have had this problem to merit an amazing, and whimsically coined invention: pickling.
Pickling is a way to store objects, functions, and classes for the future, and because sci-kit learn models are technically classes, they can be pickled too. To pickle an object, function, or class, we first need to serialize it, which is transforming an object into byte streams. Byte streams are a collection of bytes, which are 8 digit binary units of memory. Calling the byte requires us to depickle the object, function, or class, meaning we need to deserialize the object from a byte stream.
A great analogy to pickling is when teams take timeouts. The game is “saved” in a serialized state, and when the timeout is over, play resumes from that saved serialized state. In other words, it’s deserialized. Neither team had extra points or the time didn’t change. The only things that may have changed is where you’re continuing, what strategies you’re using, and what players you’re using. If we think about that in terms of coding, that would be the same as a programmer using the pickled object in a different notebook, returning to the pickled object after analyzing their results, or using that pickled object on a different project due to relevance.
It’s important to note that pickling does not cross versions. You can’t take a timeout during the November 12th Lakers vs Warriors game and resume play in the December 25th Lakers vs Warriors game. The same applies for programming; you can’t save a pickle in Python 2.7 and load it into Python 3.6.
I’m sure your next question is how to pickle. Well that’s fairly straightforward. Let’s say we have a list named chips, that contains strings of various brands of chips (i.e Doritos, Cheetos, Fritos, Tostitos, … all the o’s). To create a pickle, we need to decide two things. First is the name of the object we want to pickle, which we named chips, and second is the name of the pickle itself, which we will call chip_brands. Exporting our pickle for future use is very easy — only 2 lines of code:
with open(‘chip_brands.pkl’, ‘wb’) as pickle_out:
The ‘wb’ string is telling our computer to write a byte stream, which writes our pickle to memory. Once that’s done, we just need to call the pickle when we need it, which is just as easy. We need 2 things: the name of the pickle and we need to come up with a name for the pickle’s contents — chip_brands in our case.
with open(‘chip_brands.pkl’, ‘rb’) as pickle_in:
chip_brands = pickle.load(pickle_in)
The ‘rb’ string tells our computer to read the byte stream (aka read our pickle from memory), and that’s it, we’ve created, exported, imported, and called our pickle. Calling chip_brands opens the chips list we made earlier, and we can call the chips list to whatever notebook we want without recreating it.
Now let’s go over a more realistic example. Let’s say we have a pipe that count vectorizes some data and then passes the results to a Decision Tree Classifier model. We’ve grid-searched that model and found all the hyperparameters that generate the best estimator, and we’ve cross-val scored that model 10 times. Let’s also say that whole process took us 12 hours. Without pickling, we would have to rerun that code every time we wanted to see the reults of that model, which would be very time consuming. With pickling, we could simply assign the results of the best estimator model to some variable, pipe_cvdtc, pick a name for the pickle we want to export, pipeline, and export the pipe using:
with open(‘pipeline.pkl’, ‘wb’) as pickle_out:
Once again, all we had to specify were two things, the name of our pickle to export, and the name of what we wanted to export. Calling the pickle is just as simple. All we need to do is know the name of the pickle we want to call, and pick a variable name for its contents, cvdtc_results.
with open(‘pipeline.pkl’, ‘rb’) as pickle_in:
cvdtc_results = pickle.load(pickel_in)
This will return the results of the best estimator pipeline that we pickled earlier without having to wait 12 hours for the code to run.
There are a couple important things to consider with pickling. First, be careful of opening unknown pickles! Pickled objects can contain code meant to cause viruses, crash your computer, etc. Only open pickle objects you trust. Second, pickling does not compress the object stored inside. Some pickled models can be very large files, so keep that in mind. And that’s it! Four lines of code and you’ve saved your time, your computational power, and your sanity. Now go eat burger with extra pickles. You deserve it, you pickler you!