Mocking a Heuristic as a scikit-learn Estimator

Published in

When I Work Data

3 min readSep 27, 2018

scikit-learn models are often embedded in data pipelines to make predictions on the data passing through the pipeline. For any production pipeline, a model needs to be trained on historical data to provide accurate predictions. However, in some cases it makes sense to process the pipeline data using a simple rule that doesn’t take into account any previous data. A few cases where this would be useful are:

Pipeline MVP

When a data science project is just starting it is critical to move as quickly as possible towards a minimal viable product (MVP). This MVP will contain all of the components of the final data product, but will only be minimally functional. After the project hits this point it will be faster to iterate and improve on an already existing pipeline. Since a trained model will take significant time and energy to create, being able to place a mock model in the data pipeline will allow the data engineers to begin their work before the data scientists have completed training the final model.

Performance Baseline

Another case where a mock model will be useful is in establishing the minimum performance necessary for a model to be valuable. For example, if the model is trying to predict which customers will leave and which will stay, then a naive model might predict that all customers will stay. While it has high accuracy, it’s precision will be poor. Any viable model will need to beat the naive model’s performance. Having a mock model that follows a simple rule and can be plugged into the same analysis code will allow data scientists to measure its performance in the exact way that they measure it for real models.

Mock Estimator

Now that we know the purpose behind creating a mock model, let’s look at how to accomplish it in practice. The following code creates the MockBinaryClassifier class which acts as a binary predictive model. The model follows the simple rule that if the value of the first feature is less than or equal to 0 it returns class 0 and if the first feature is greater than 0 it returns class 1. All the code in this story is available as a single script on Github.

There are a few interesting things to note about MockBinaryClassifier. The first is that the class inherits from sklearn.base.BaseEstimator. This means it can be used anywhere a scikit-learn estimator is used. Another point of interest is that thefit function is essentially a no-op. Since we know how we are going to classify instances there is no need to take into account any historical data.

Demonstration

Here is some example code to show MockBinaryClassifier in action:

[0 1 1 0]

The example shows the predict function scoring a set of one-feature instances. You might object that this doesn’t demonstrate that the class can be used just as any other scikit-learn estimator since the code would work even without inheriting from sklearn.base.BaseEstimator. Here is code that shows how MockBinaryClassifier works inside a scikit-learn pipeline:

[0 1 1 0]

The same MockBinaryClassifier will work as a replacement for a scikit-learn estimator in a data pipeline of arbitrary complexity. The only potential change necessary would be implementing additional functions such as predict_proba and so forth.

One caveat to be aware of is that if the data pipeline includes loading models from persistent storage using pickle, the MockBinaryClassifier will need to be imported in the loading environment.

Conclusion

Creating mock models using a heuristic is an excellent way to remove bottle-necks in the development cycle. It lets data engineers and data scientists work in parallel without the engineers waiting for a real model to be built. It also allows data scientists to set a baseline performance standard using the exact code that they will use to build the real model.