Putting Scikit-Learn models into production with PMML

Using Sklearn2pmml from pipelining to modeling to production

Xiao Wei
3 min readOct 21, 2018

Predictive models are typically built in languages like R or Python. However, most websites and backend services are built using other languages like Java. How to convert a model in Python/R into another language? Enter PMML.

PMML stands for “Predictive Model Markup Language”. It is an XML based file format that serves as a intermediary between different programming languages. A model could be created in Python/R, saved as an XML file, and then be handed off to a software engineer for production. Below is an excerpt from a PMML file. It is legible to a human reader; however, it would be very difficult and inadvisable to manually edit/produce such a file.

PMML file sample

There exists a sklearn2pmml package that can automatically package your sklearn data pipelines that feed into your model along with the model itself into the same PMML file!

Below is a simple sklearn data prep and grid search pipeline that ends in the creation of a PMML file. Note that the entire sklearn pipeline is saved in the PMML file not just the model itself.

The code below creates an Sklearn pipeline that preprocesses the data and uses gridsearch to create a model. The sklearn2pmml package takes that pipeline and create a PMML file that encompasses all the steps in the Sklearn pipeline.

More advanced sklearn2pmml

sklearn2pmml is also capable of more sophisticated control over your pipelines. You can also create new features, apply simple data transformations (such as natural log transformation), user specified values for null imputation, and more all within the sklearn2pmml package with some help from the sklearn-pandas package.

Below is a file that uses the sklearn2pmml package to create the imputation, standardization, light feature engineering, and model building pipeline. It is a little more involved than using just the sklearn pipeline functionality from above but it can save time during the model production process and reduce errors in implementation.

The code below uses sklearn’s built in pipelines along with sklearn-pandas and sklearn2pmml to…

  1. Create one variable that is a function of two other columns using sklearn2pmml’s ExpressionTransformer
  2. Then on certain columns it first imputes missing values followed by a natural log transformation
  3. Finally the columns created in step 1 and 2 above and concatenated to the rest of the columns for the rest of the pipeline

The code below is uses the featureU created above to implement the rest of the pipeline and model.

For the full code please see my Github repository.

Below is a presentation by the creator of sklearn2pmml for reference on the more sophisticated pipelines.

Some last minute tips…

  1. At the time of this writing, numeric arguments to sklearn2pmml functions require Python data types (int, float). Using NumPy datatypes (int64/32) will result in a cryptic error code when converting pipelines into PMML.
  2. PMMLPipeline should only be used in the most outer layer of the overall model pipelines. Nested pipelines should only use Sklearn pipelines.
  3. In the near future it may make more sense to use the newly created ColumnTransformer pipeline rather than use sklearn_pandas’ DataFrameMapper

Also, see this article which contains links to other articles I have written on Medium detailing advanced preprocessing steps using sklearn2pmml.

--

--