ManoMano Tech  Team
Published in

ManoMano Tech Team

A framework for feature engineering and machine learning pipelines

This blog post presents a simple yet efficient framework to structure machine learning pipelines and aims to avoid the following pitfalls:

  • Spaghetti code
  • Long debugging
  • Asymmetries between training and inference

On the opposite, it aims at:

  • Enhancing code maintainability
  • Improving iteration speed
  • Making transition to production easier

We refined this framework through experiments both at DataScience competitions and ManoMano (an European DIY & Gardening marketplace with ~1M daily users).

This post introduces two different core concepts at the same time, namely feature engineering (the process of transforming raw data to meaningful features to feed the desired algortihm) and machine learning pipeline (sequential data transformation workflow from data collection to prediction). It might seem to complicate understanding to speak about these two. However it is key to understand one while keeping in mind the other because they are heavily linked. They must be applied in coordination in order to make your project succeed.

Rather than having a theoretical presentation, we decide to explain our generic strategy by examples. To do so, we are going to build a pipeline to predict the conversion rate of a product (number of conversion / number views).

Can we expect a good conversion rate for these products?

We will iteratively enrich and increase the complexity of our machine learning pipeline by:

  1. Starting simple with a single dataset
  2. Adding a second dataset
  3. Adding more datasets

Alongside building it, we will incorporate different families of feature engineering and give examples for each of them.

Start simple with a main dataset

A classical start is to extract a dataset at the granularity of our desired prediction, let name it main dataset. For our case it is a dataset with primary keys (date, product_id), product’s information (eg category_id, price, title, description, delivery time, …) and a column indicating whether the product was purchased or not when it was viewed. This prediction-granularity dataset is most likely what you faced at schools, MOOC and DataScience competitions.

Here is what a basic pipeline will look like:

  1. Extraction: Extract data from a source (database, CSV file, API …) (eg : product’s information such as price, title, description, delivery time, …)
  2. Preprocessing: Clean data (eg : rename columns, clean strings, handle or not missing values, …)
  3. Row-wise Feature Engineering: Enrich existing features by creating new columns (eg : add shipping price to price to get the full price of a product, description length, …)
  4. Training: Train whatever model you want (the easiest part for sure!)

Our first pipeline schema (without training) looks like this:

The Python code like this:

import extract as extractimport preprocessing as prep
import feature_engineering as fe
products = extract.extract_product()
products = prep.preprocessing_product(products)
products = fe.fe_product(products)
X = product[columns_to_keep]
y = learning[‘cr’]
my_model.fit(X, y)

At this step, any operations involving column-wise operations, such as group by or window functions, are strictly prohibited! We will discuss why later on.

Add a new dataset

Consider an additional dataset that relates to product ratings. It can not be merged directly because products and ratings form a one-to-many relationship. A simple merge would change the granularity of the dataset from (date, product_id) to (date, product_id, rating_id). To solve this problem, an aggregation is needed to keep one row per product. For example, we can compute the average rating and the number of ratings for each product.

We can see this new dataset as a new “block”, which must respect the same pattern as previously described (extract, preprocessing, feature engineering). As soon as this new dataset is compatible with the prediction granularity (date, product_id), it is ready to be merged with the main dataset.

Tip: A good sanity check test to validate granularity is respected, is to have the same number of rows before and after any merge with the main dataset.

Our current pipeline schema looks like this:

The Python code like this:

import extract as extractimport preprocessing as prep
import feature_engineering as fe
products = extract.extract_product()
products = prep.preprocessing_product(products)
products = fe.fe_product(products)
ratings = extract.extract_ratings()
ratings = pred.preprocessing_ratings(ratings)
ratings = fe.fe_ratings(ratings)
learning = productsnrow_begin = learning.shape[0]
learning = pd.merge(learning, ratings, how=’left’)
learning = fe.fe_learning(learning)assert nrow_begin == learning.shape[0]X = learning[columns_to_keep]
y = learning[‘cr’]
my_model.fit(X, y)

We obtain a learning set. Here we refer to this dataset as a “learning set” to make it independent of training, validation and test sets that are more common but related to cross-validation framework.

By doing so, it is clear where your data comes from and which transformation is performed at each step. It avoids mixing extract, preprocessing and feature engineering steps. The philosophy is to not postpone what can be done previously. You must transform your data at the earliest stage possible.

Note: The learning set also has its own feature engineering, namely contextual feature engineering. We will discuss this later on.

Add more datasets

Let’s consider now additional datasets and their associated possible transformations:

  • Product attributes (eg : power of a drill, diameter of a pipe, depth of a swimming pool) can be one-hot-encoded or counting the number of attributes found
  • Create datasets from existing datasets with column-wise operations, such as groupby and windows functions, at different granularity to compute column-wise features such as :
  • Historical conversion rate by product
  • Historical conversion rate by categories, regions, brands, …
  • Average product price from the last K days
Russian dolls, an illustration of creating new datasets from existing one with different group by granularity.

Keeping the principle that each dataset is independent, we can scale our pipeline horizontally. Our code is modular, thus easing maintainability. It is extremely simple to understand how each feature is obtained. If a problem arises, it is trouble-free to trace errors back to the root cause. To sum it up, we avoid mixing preprocessing, feature engineering and merging steps.

Our final pipeline looks like something like this:

// previous code
learning = products
nrow_begin == learning.shape[0]
for df in list_df_to_merge:
learning = pd.merge(learning, df, how='left')
learning = fe.fe_learning(learning)assert nrow_begin == learning.shape[0]X = learning[columns_to_keep]
y = learning[‘cr’]
my_model.fit(X, y)

I previously talked about contextual feature engineering in the learning set. Contextual feature engineering is a special-type of row-wise feature engineering, it relies on multiple datasets to compute row-wise operations.

Examples of contextual feature engineering which have to be done in learning set because they depends on multiple datasets :

  • Difference between current date and last purchase date (number of days since last purchase)
  • Difference between current date and birthday (age of a product)
  • Difference between current price and average price (price trend)
  • Difference between average category price and product price

Summary and conclusion

Throughout this post, we detailed how to standardize pipelines for machine learning projects.

We learnt:

  • How to cleanly separate datasets and the specific operations (extract, preprocessing, feature engineering) expected at each steps
  • The differences between row-wise, historical/aggregated and contextual feature engineering

Overall, respecting this framework will boost your productivity by making your code more understandable, ease debugging and will help you at naturally getting a symmetrical training/inference pipeline. Spending less time on these time consuming tasks will give you more time to make your human creativity shine on business problem solving.

Acknowledgements

Thanks to Alexandre Cazé, Romain Ayres, Baptiste Rocca, Yohan Grember, Marin de Beauchamp, Hugo Epicier, Antoine Vastel for many corrections and suggestions.

Behind the scenes : we share stories about our product, our data science & our engineering lives

Recommended from Medium

Emergency Room Visits: A New Way To Predict Thunderstorms?

Improving Diversity through Recommendation Systems in Machine Learning and AI

Every day, we use data to draw conclusions.

Advanced Visualisations for Text Data Analysis

Brain Teaser | Quant Stipend

Chatbots Made Easier With Rasa 2.0

A Gaming Data Story — My FM Memoir

Model Construction on Multiple Time Series

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jacques Peeters

Jacques Peeters

Senior Data Scientist @ ManoMano | Kaggle Competition Expert https://www.linkedin.com/in/j4cquespeeters/

More from Medium

What is Data Lake?

Data Discovery for ML Engineers

Robotic Arm Predictive Maintenance- An Oracle Lakehouse Solution for Manufacturing.

How FiscalNote is Leveraging a Data Lakehouse to Accelerate Integration from M&A