The Preprocessing Collection Project

A library of various transformers for all kinds of feature engineering

Chris Lemke
Geek Culture
3 min readSep 25, 2022

--

Picture by the author — actually by stable diffusion

Normally you would expect something about certain types of preprocessing or something similar to be mentioned here. But today it should be about something else.

Reuse and recycle ♺

Data preprocessing plays a major role in the daily life of every data scientist. Formatting one column here, converting the values there, etc. You know what I mean. Good that there are Pandas, scikit-learn, etc.
If you work with tabular data, you can be happy to use the same preprocessing method again every now and then. scikit-learn’s pipelines help to make the preprocessing steps formal and easy to implement. After all, the structure of the projects and processes is somehow similar: An X and a y, the X gets transformed — sometimes an encoder or similar is fitted and in the end, the end product of the preprocessing steps ends up in a model. Of cause, this is oversimplifying the complexity. But I guess you get my point. If you have done it well, you can use the preprocessing steps that led to the training one-to-one in the production system.

As a software developer and as someone who doesn’t want to completely screw up this planet, I have learned:

“Reuse and recycle”

That’s why I’m especially happy when I can recycle my preprocessing code. Internally we have a small package where we collect — more or less organized — all our preprocessing methods. Of course, it is important that the implementations always have some clear and consistent API. Otherwise, recycling becomes a pain and does not bring the hoped-for simplicity of use.

A while ago I started to search for a similar collection of those kinds of steps. Of course, you can also find them in the vastness of the Internet: Feature-engine — a great project, should be mentioned here only as one among many. However, I have not found a collection of more or less simple preprocessing methods, which also contain more specific steps.

Why is there not a simple collection of preprocessing methods, where everyone can contribute easily and quickly?

A formalized collection, so it is easy to use, easy to implement and the steps can be combined to fit more complex preprocessing needs. A collection where participation is quick and not complicated and where other data scientists can not only find other preprocessing steps but also get inspired by ideas of others: “Ah, that’s quite a good idea — I will try this on my data too”.

Since I have not found such a thing. I had to start it myself: feature-reviser.

For now, the collection is pretty small but I think together we can gather a lot of useful ideas and inspire each other. Because even if datasets and projects are diverse — somehow a lot of code can be shared between projects, companies, and people. So feel free to reuse, recycle and contribute!

A short insight

scikit-learn’s fit/transform API has been widely used. No matter if XGBoost, CatBoost, or the methods of Feature-engine — all follow this structure. And so do the preprocessing steps of the collection. Here is an oversimplified example:

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class DummyTransformer(BaseEstimator, TransformerMixin): def __init__(self, string_to_replace: str, column: str) -> None:
self.string_to_replace = string_to_replace
self.column = column
def fit(self, X=None, y=None) -> "DummyTransformer":
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
X = X.copy()
X[self.column] = X[self.column].replace(self.string_to_replace, "DUMMY!") return X

Ja! For those who didn’t know, it’s so easy to formalize preprocessing steps and make it compatible with scikit-learn, etc.

But enough of reading. Just have a look at the project, check out the transformer collection. Use it, express and share your preprocessing ideas, check out the docs and start recycling code! Together we can make everybody’s data science life a bit better.

Thanks for reading!

--

--