Extreme/default value specification in Scikit-Learn & PMML utilizing ContinuousDomain

Xiao Wei
2 min readApr 18, 2019

Frequently a modeler will want to limit the range of values a predictor can be. For example, many real life variables, like income or time to completion, tend to be log-normally distributed. The modeler may not want a super high value on certain variables to overwhelm the other variables in the model for predictions. Likewise, many modelers don’t want the range of values for their predictors to be outside the range of values used during model training time.

Scikit-Learn doesn’t provide any functions to limit extreme values explicitly and one can always write your own custom functions using FunctionTransformer. However, custom functions are not supported if one wants to port the model into a PMML file.

The sklearn2pmml package supports the ContinuousDomain() functionality that will allow the user to specify a minimum, maximum, and default values for one or more variables.

In the below example, the ContinuousDomain pipeline step will set all values lesser than 300 and 1500 to those values respectively. It will also set the specified missing values (np.nan and -1) to 350. I have included a couple subsequent pipeline steps as an example of how ContinuousDomain can be incorporated into an overall, larger, pipeline.

Sometimes you will want to encode a variable to be binary on whether or not the values are null. In regular Sklearn/Python you can use pd.isnull along with `FunctionTransformer` to create the column. For sklearn2pmml, `ExpressionTransformer` can utilize pandas isnull function. See gist below for the step to create a binary column on a column with null values.

--

--