Creating dummy variables from status codes in Sklearn and PMML using LookupTransformer, ExpressionTransformer, and Alias

Xiao Wei
2 min readApr 24, 2019

Sometimes raw data comes in the form of status codes. For example, the codes come in four columns and each value represents some kind of a categorical value.

In the table below each status code means the same regardless of which status column it appears in. Sklearn’s OneHotEncoder and LabelBinarizer will not transform these variables correctly.

If OHE was applied to the table above, status code “201” would have two columns, “Status_1_201” and “Status_2_201”. The correct transformation would consist of only one dummy variable column for status code “201”.

The code below first reads in the three status code columns and uses sklearn2pmml’s LookupTransformer to transform them into dummy variable columns for whether each column contains ‘203’. ExpressionTransformer is then used to sum across the three freshly created dummy variable columns into a single column representing the total number of ‘203’ values across the original columns. Note that for status codes, a single row won’t have more than one of the same status code across all columns; therefore, ExpressionTransformer won’t create values other than 1 or 0.

If the modeler wants to dummy up more than one status code. A programmatic solution to generate the kind of code above should be used.

As the time of this writing, you will run into an error if you use LookupTransformer on the same variable more than once. If an analyst were to edit the above code and add in another set of status codes that use the same variables and transformers, he would get an error akin to the below excerpt when attempting to convert the Sklearn pipelines to PMML pipelines.

SEVERE: Failed to convert
java.lang.IllegalArgumentException: lookup('Status 1')

Below is an example of how to get around this error. The transformers need to be wrapped in an Alias so that each transformer has a unique name for the JPMML converter.

--

--