Pickle your model in Python

Maziar Izadi
6 min readJan 6, 2020

--

As a data scientist, I am a big fan of Jupyter Notebook as it provides a user friendly and easy-to-use UI to write your code and get feedback straight after. However, I’ve been wondering about the means of packaging my work and make it reusable.

  • what if, I need to share my code with a stakeholder who doesn’t use Jupyter?
  • Does the model need to be trained every single time someone wants to see the result?
  • What if they don’t even have Python installed? How can a new dataset feed into the model and get output?

On the other hand, different format of data such as Dictionaries, DataFrames, Lists and so on as well as modules and libraries are used in your code which might need to be saved to a file so that they can be used later on.

I will answer to all these questions in my upcoming articles. But a solution as part of these problems came in where I came across what was called Python’s Pickle. As it is explained clearly by GeeksforGeeks:

Python pickle module is used for serialising and de-serialising a Python object structure. Any object in Python can be pickled so that it can be saved on disk. What pickle does is that it “serialises” the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.

The whole idea is to use the Pickle module to implement binary protocols for serialising and de-serialising a Python object structure so that the Python object is converted into a byte stream.

The main benefit of Pickle for data scientists is when they are working with machine learning algorithms. Imagine that you have split your data set, used Logistic Regression to build a model and trained your model, tested it, re-trained it, tested and went through all the cycle to enhance to your desirable level of accuracy and precision.

  • What happens when you need to make a new prediction in the future using the same model?

Pickle helps you to be able to do that without having to rewrite everything or train the model all over again.

What can be Pickled?

In order to store data with Pickle, you should know what can be pickled directly and what are the workarounds for those which cannot directly. As it is clearly mentioned by DataCamp, the following can be pickled:

  • Booleans,
  • Integers,
  • Floats,
  • Complex numbers,
  • (normal and Unicode) Strings,
  • Tuples,
  • Lists,
  • Sets, and
  • Dictionaries that contain picklable objects.
  • You can also pickle classes and functions, if they are defined at the top level of a module.

Not everything can be pickled (easily), though: examples of this are generators, inner classes, lambda functions and defaultdicts. In the case of lambda functions, you need to use an additional package named dill. With defaultdicts, you need to create them with a module-level function. For more info on dill, please check PYPI documentation.

Pickling use-cases

PythonTips has gathered a few use cases from stackoverflow which I have cited below. It is very useful when you want to dump some object while coding in the python shell. So after dumping whenever you restart the python shell you can import the pickled object and deserialize it.

  1. Saving a program’s state data to disk so that it can carry on where it left off when restarted (persistence)
  2. Sending python data over a TCP connection in a multi-core or distributed system (marshalling)
  3. Storing python objects in a database
  4. Converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).

Other advantages of using Pickle Module, cited from:

  1. Recursive objects (objects containing references to themselves): Pickle keeps track of the objects it has already serialised, so later references to the same object won’t be serialised again. (The marshal module breaks for this.)
  2. Object sharing (references to the same object in different places): This is similar to self- referencing objects; pickle stores the object once, and ensures that all other references point to the master copy. Shared objects remain shared, which can be very important for mutable objects.
  3. User-defined classes and their instances: Marshal does not support these at all, but pickle can save and restore class instances transparently. The class definition must be importable and live in the same module as when the object was stored.

Pickle vs cPickle vs Marshal vs Joblib vs JSON

A comparison between Pickle and other similar tool

The cPickle module implements the same algorithm as pickle, in C instead of Python. It is many times faster than the Python implementation (some reference argue 1000 times), but does not allow the user to subclass from Pickle. If subclassing is not important for your use, you probably want to use cPickle.

For more info on Joblib, I suggest reading their documentations here.

How to Pickle files

As usual, you have to start pickling by importing it

import pickle

For this article, I am just going to keep it simple with an easy example. However, I will write another article soon after this to introduce flask and how it makes showcasing super convenient for data scientist (stay tuned, very fun stuff). In the future article, you will see a more complex use of pickle.

For now, let’s assume that we have a fitted Linear Regression model and we’ve called it regressor using the following sample:

regressor = LinearRegression()#Fitting model with training data
regressor.fit(X, y)

now we want to save the model to disk. We simple use the dump() function in pickle and save the model, as follow:

pickle.dump(regressor, open('model.pkl','wb'))

There are a couple of notes that you must pay attention to.

  1. A file name has to be passed in the function. This file name will be used by the code to generate the pickled file. File name does not necessarily require a .pkl extension. It can also be written as open('model','wb')
  2. When using the open() function, 'wb' is required. As explained on top, pickle module uses binary protocol and as a result, the file mode should also be in write and binary format.

In addition to that, dump function is used to write the pickled representation of the object (the model in our example) to the open file. Dump format is as below and for further details you can refer to Python docos here.

pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)

De-pickling files

Let’s say that I have my model pickled (or received a pickled file from another data scientist) and need to load it back into my Python program.

It is very simple and similar to what we saw earlier in pickling process. We need to use the open() function again, but this time with 'rb' as second argument. Obviously, r stands for read mode and b is binary, like before.

The function which is used to de-pickle the file is load() from the pickle library. As below:

# Loading model to compare the results
model = pickle.load(open('model.pkl','rb'))

In a nutshell

As a data scientist, you need to serialise your code (including multiple components) for several reasons such as saving your fitted model on the disk. Python has provided the pickle library which makes the life much easier for data scientists who work with ML algorithms all the time. Using pickle, simply save your model on disc with dump() function and de-pickle it into your python code with load() function. Use open() function to create and/or read from a .pkl file and make sure you open the file in the binary format by wb for write and rb for read mode.

Upcoming next

  • Joblib and how to use it
  • Python Flask — Model output presentation has never been so easy

--

--

Maziar Izadi

I set goals ambitiously…I take actions quickly…I write…to learn…I play music… to meditate. https://www.linkedin.com/in/maziarizadi/