A framework for feature engineering and machine learning pipelines

Jacques Peeters
Apr 17, 2020 · 6 min read

This blog post presents a simple yet efficient framework to structure machine learning pipelines and aims to avoid the following pitfalls:

  • Spaghetti code
  • Long debugging
  • Asymmetries between training and inference

On the opposite, it aims at:

  • Enhancing code maintainability
  • Improving iteration speed
  • Making transition to production easier

We refined this framework through experiments both at DataScience competitions and ManoMano (an European DIY & Gardening marketplace with ~1M daily users).

This post introduces two different core concepts at the same time, namely feature engineering (the process of transforming raw data to meaningful features to feed the desired algortihm) and machine learning pipeline (sequential data transformation workflow from data collection to prediction). It might seem to complicate understanding to speak about these two. However it is key to understand one while keeping in mind the other because they are heavily linked. They must be applied in coordination in order to make your project succeed.

Rather than having a theoretical presentation, we decide to explain our generic strategy by examples. To do so, we are going to build a pipeline to predict the conversion rate of a product (number of conversion / number views).

Can we expect a good conversion rate for these products?

We will iteratively enrich and increase the complexity of our machine learning pipeline by:

  1. Starting simple with a single dataset
  2. Adding a second dataset
  3. Adding more datasets

Alongside building it, we will incorporate different families of feature engineering and give examples for each of them.

Start simple with a main dataset

Here is what a basic pipeline will look like:

  1. Extraction: Extract data from a source (database, CSV file, API …) (eg : product’s information such as price, title, description, delivery time, …)
  2. Preprocessing: Clean data (eg : rename columns, clean strings, handle or not missing values, …)
  3. Row-wise Feature Engineering: Enrich existing features by creating new columns (eg : add shipping price to price to get the full price of a product, description length, …)
  4. Training: Train whatever model you want (the easiest part for sure!)

Our first pipeline schema (without training) looks like this:

The Python code like this:

import extract as extractimport preprocessing as prep
import feature_engineering as fe
products = extract.extract_product()
products = prep.preprocessing_product(products)
products = fe.fe_product(products)
X = product[columns_to_keep]
y = learning[‘cr’]
my_model.fit(X, y)

At this step, any operations involving column-wise operations, such as group by or window functions, are strictly prohibited! We will discuss why later on.

Add a new dataset

We can see this new dataset as a new “block”, which must respect the same pattern as previously described (extract, preprocessing, feature engineering). As soon as this new dataset is compatible with the prediction granularity (date, product_id), it is ready to be merged with the main dataset.

Tip: A good sanity check test to validate granularity is respected, is to have the same number of rows before and after any merge with the main dataset.

Our current pipeline schema looks like this:

The Python code like this:

import extract as extractimport preprocessing as prep
import feature_engineering as fe
products = extract.extract_product()
products = prep.preprocessing_product(products)
products = fe.fe_product(products)
ratings = extract.extract_ratings()
ratings = pred.preprocessing_ratings(ratings)
ratings = fe.fe_ratings(ratings)
learning = productsnrow_begin = learning.shape[0]
learning = pd.merge(learning, ratings, how=’left’)
learning = fe.fe_learning(learning)assert nrow_begin == learning.shape[0]X = learning[columns_to_keep]
y = learning[‘cr’]
my_model.fit(X, y)

We obtain a learning set. Here we refer to this dataset as a “learning set” to make it independent of training, validation and test sets that are more common but related to cross-validation framework.

By doing so, it is clear where your data comes from and which transformation is performed at each step. It avoids mixing extract, preprocessing and feature engineering steps. The philosophy is to not postpone what can be done previously. You must transform your data at the earliest stage possible.

Note: The learning set also has its own feature engineering, namely contextual feature engineering. We will discuss this later on.

Add more datasets

  • Product attributes (eg : power of a drill, diameter of a pipe, depth of a swimming pool) can be one-hot-encoded or counting the number of attributes found
  • Create datasets from existing datasets with column-wise operations, such as groupby and windows functions, at different granularity to compute column-wise features such as :
  • Historical conversion rate by product
  • Historical conversion rate by categories, regions, brands, …
  • Average product price from the last K days
Russian dolls, an illustration of creating new datasets from existing one with different group by granularity.

Keeping the principle that each dataset is independent, we can scale our pipeline horizontally. Our code is modular, thus easing maintainability. It is extremely simple to understand how each feature is obtained. If a problem arises, it is trouble-free to trace errors back to the root cause. To sum it up, we avoid mixing preprocessing, feature engineering and merging steps.

Our final pipeline looks like something like this:

// previous code
learning = products
nrow_begin == learning.shape[0]
for df in list_df_to_merge:
learning = pd.merge(learning, df, how='left')
learning = fe.fe_learning(learning)assert nrow_begin == learning.shape[0]X = learning[columns_to_keep]
y = learning[‘cr’]
my_model.fit(X, y)

I previously talked about contextual feature engineering in the learning set. Contextual feature engineering is a special-type of row-wise feature engineering, it relies on multiple datasets to compute row-wise operations.

Examples of contextual feature engineering which have to be done in learning set because they depends on multiple datasets :

  • Difference between current date and last purchase date (number of days since last purchase)
  • Difference between current date and birthday (age of a product)
  • Difference between current price and average price (price trend)
  • Difference between average category price and product price

Summary and conclusion

We learnt:

  • How to cleanly separate datasets and the specific operations (extract, preprocessing, feature engineering) expected at each steps
  • The differences between row-wise, historical/aggregated and contextual feature engineering

Overall, respecting this framework will boost your productivity by making your code more understandable, ease debugging and will help you at naturally getting a symmetrical training/inference pipeline. Spending less time on these time consuming tasks will give you more time to make your human creativity shine on business problem solving.

Acknowledgements

Manomano Tech

Behind the scenes : we share stories about our product, our…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store