Weight of Evidence Binning in Scikit-Learn & PMML

Xiao Wei
2 min readApr 18, 2019

Frequently in credit risk modeling it makes sense to transform a continuous variable into one or more discrete variables. A binned variable will allow a regression model to treat the variable similarly to splits in a tree based model. Another reason to bin may be to comply with lending laws. In the U.S. lenders can use age as part of a credit worthiness model; however, persons of age 62 and above cannot be penalized due to their age. Binning the age variable can help ensure compliance with the law.

Unfortunately, there isn’t any support for weight of evidence binning in Sklearn. However, in sklearn2pmml package there exists a CutTransformer. CutTransformer is intended to discretize a continuous variable into discrete variables (not binning). However, one can use the function for binning purposes by specifying float values instead of string objects. CutTransformer is based off of the pandas.cut function.

The first part of the process is to create the bin breaks and the bin values. This first part is exogenous to Sklearn and the PMML process. You can use whatever package or language you choose. I personally prefer R’s woeBinning package.

Below is sample code that creates two bins for values ranging from 0 to 2,200. Note that values that exceed this range will be encoded as np.nan. You can fill in these null values using SimpleImputer or the ContinuousDomain() functionality.

--

--