(woe)Binning Discrete Values and Avoiding the Dummy Variable Trap in Scikit-Learn & PMML

Xiao Wei
2 min readApr 19, 2019

One Hot Encoding of categorical variables is something that happens in almost every model. Sklearn provides two functions that can handle categorical variables, LabelBinarizer and the newly introduced in v0.20 OneHotEncoder.

OneHotEncoder is more richly featured than LabelBinarizer and is the recommended choice if one is deciding between the two with no consideration for PMML. One big advantage OneHotEncoder(“OHE”) has over LabelBinarizer is handling new unknown values during prediction time. If LabelBinarizer encounters a value that the function has not encountered during training time, it will throw an error. Whereas for OHE, the modeler can pass the ‘ignore’ value to the ‘handle_unknown’ parameter and OHE will create values of zero for all transformed columns (which should be the correct behavior).

Sklearn does not support mapping of discrete values to other discrete values (bucketing).

This problem can be solved by combining either LabelBinarizer or OneHotEncoder along with the LookupTransformer function in sklearn2pmml.

Below is an example of using LookupTransformer to map raw values to their transformed values. In addition, the columnTransformer function will drop the ‘first’ column created by the previous two pipeline steps, therefore avoiding the dummy variable trap. (You don’t need to do this if you are using OneHotEncoder as you can use the ‘drop’ parameter for OneHotEncoder).

It is recommended that LookupTransformers be prefaced with null imputation as the pipeline will run very slowly when encountering null values.

--

--