There are many libraries in the wild that have different views on how machine learning must be implemented. Also, there are common stages of developing a model, such as feature engineering, training, validation, and library devs try to make them easier and faster. But most of libs don’t care for later steps of your ML Workflow [https://cloud.google.com/ml-engine/docs/tensorflow/ml-solutions-overview/].
In software engineering, for instance, after you implement a program, you need to make sure that it works correctly. After testing shows that your program works as intended, you release a program, i.e. prepare some kind of an artifact, e.g. tarball, jar file, installer, or even docker image. This artifact could be repeatedly deployed on client machines.
On the other hand, machine learning is a completely different story. Each library promotes its own vision on what a model artifact is, or even worse — they don’t care.
Pickle is a solution for model export in some libraries (Scikit), which doesn’t have a proper export. The binary serialization format developed for Python language, really comes handy in certain situations. Keeping in mind slow deserialization, big artifact sizes, and security vulnerabilities with potential execution of arbitrary code. [https://www.benfrederickson.com/dont-pickle-your-data/]
There are many articles about issues with Pickle, but people still use it. What makes the situation worse, is the awkwardness of integration Pickle-exported models into production. In my personal experience, I integrated the Pickle model into microservice, and ended up building my own metadata-for-pickle file format with complete description of the model artifact.
Adding insult to injury, slight differences in Python version or even in installed dependencies in your Python environment, could cause weird runtime errors during unpickling. So you need to be extra careful with recreating the proper environment for pickled models.
Overall, I would use Scikit only for prototyping and Pickle only for internal purposes such as checkpointing. But if you are balls deep in Scikit and can’t ditch it, I recommend to look for third-party model export solutions.
On the bright side, not all libraries abandon users with their models. Tensorflow for instance provides two ways how user can export a model: Freeze — puts your model into protobuf message as a checkpoint, and SavedModel — which, in addition to freezing provides some useful metadata and it meant to be the correct way of exporting your model for further deployment. The benefits are obvious — you don’t need to work with obscure binary archive, but instead you have strictly typed and explained protocol buffer message. No code execution inside the artifact and implicit dependencies — pure data. However, there is a catch — protocol buffer message is not intended to work with big messages, and you might encounter some problems parsing them. (https://developers.google.com/protocol-buffers/docs/techniques).
Keras is another example of custom solution for model persistence. Since execution on Keras by default is based on top of Tensorflow backend, it’s logical to assume that every valid Keras model is a valid Tensorflow graph. Thus, it opens opportunities of using Freeze and SavedModel. But in addition to that, Keras devs adapted HDF5 format as a way so save your model. HDF5 is a hierarchical file format developed to store and organize tons of scientific data. Thehe choice looks reasonable, but an edge case may hide a pitfall. For example I had to write a Scala program to extract and analyze a model architecture from a file. Apparently, the API and tooling for this format is quite obscure, lacks documentation, and what makes things worse is Keras’s FAQ as the only documentation, which obviously is not a proper source of information on how do they save models.
Keeping in mind all the possible file formats and library nuances, it’s very easy to lose your temper. Community quickly realized that we need one single file format with sensible documentation and capabilities to develop your own tools. Here comes ONNX — Open Neural Network eXchange format. Ideally, it allows us to deploy our ONNX-exported models in various supported libraries. However I encountered issues trying to serve CNTK model in Caffe2. Looks like the format is still in development and lack proper implementation and support in popular libraries.
In addition there is also a worrying tendency to implement different sets of operations. For instance, your CNTK model will use one set of operations, but your target environment uses Caffe2 with a different set. As a result, despite the usage of ONNX, you won’t be able to use the model, and you still tied to your training library.
Though there are plenty of frameworks and tools for models training developed, it does not look like a gap between model training and ML production is among concerns of those frameworks creators. In most cases, everything you have is: dependencies placed in your python virtual environment, model files scattered across your filesystem, and model architecture is described by code in some cryptic Jupyter Notebook filled with scary formulas and tensor math. That is a model but not a product ready to serve tasks yet.
There is a way to go together with DevOps and Engineers to finally deliver the value to consumers — business, operations and market. That is not the part of the way Data Scientists like most but they have to go it through and moreover — it consumes most amount of their time, which could be utilized to work with Data, Features and Models.
That’s the thing that really grinds my gears.