Feature Engineering pipeline for production ML

Deeptij
3 min readJul 25, 2021

--

Feature Engineering (FE) is a crucial step in a machine learning process. How to represent data in a form that increases its predictive capability while concentrating information in fewer features? The latter, commonly known as “dimensionality reduction” reduces the onus of computational costs associated with production level ML models.

Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.

— Prof. Andrew Ng.

An iterative process

Feature Engineering: an iterative process

In production ML framework, FE should be seen as an iterative process that evolves as the model is optimised. We begin with available features and transform/combine them using FE techniques. The parameters of the ML model are then optimized with the engineered features as input. The model performance can be only improved to a certain extent unless new information is added. The new features are combined with existing features and transformed using FE techniques to retrain towards an optimal performance.

Feature engineering techniques

The techniques most commonly used for FE are shown in the figure below:

Commonly applied feature engineering techniques

Looking at the techniques shown above, it can be seen that FE techniques take data from one vector space to another.

Feature Engineering pipeline

To build a FE pipeline we shall be using Tensorflow Transform and different components provided by Tensorflow Extended package.

Feature Engineering pipeline schema(https://www.tensorflow.org/tfx/guide)

ExampleGen

ExampleGen pipeline
  1. ExampleGen takes input into three formats: CSV, TFRecord and Bigquery
  2. It splits the data into training and evaluation datasets (default partition: 2/3 and 1/3)
  3. Both training and evaluation datasets are stored in “TFRecord” format. TFRecord optimizes read and write operations within Tensorflow particularly for large datasets.

Statistics Gen

Statistics Gen computes different statistics characterising training and evaluation datasets. Behind the scenes it uses TFDV package (https://medium.com/@deeptij2007/tensorflow-data-validation-tfdv-5e36fc74d19a).

Schema Gen

Schema Gen is used to describe the schema of training and evaluation datasets. Like the component Statistics Gen, it uses TFDV package under the hood. Usage of Schema Gen is described in the article https://medium.com/@deeptij2007/tensorflow-data-validation-tfdv-5e36fc74d19a.

Example Validator

This component detects anomalies in the training and evaluation datasets using the TFDV package. The python code for the same can be found here https://medium.com/@deeptij2007/tensorflow-data-validation-tfdv-5e36fc74d19a.

Transform

Schematic diagram showing the input and output of transform module

The Tranform module take uses the tensorflow Transform library *https://www.tensorflow.org/tfx/transform/get_started). It takes outputs from Example Gen and Schema Gen and a preprocessing module that implements the transformation.

The following are the outputs from the transform component:

  1. Transform Graph: graph that can perform the preprocessing operations
  2. transformed_examples: preprocessed training and evaluation data
  3. updated_analyzer_cache: save previous runs

The python code implementing the different components of FE pipeline can be found here https://github.com/deeptij2007/ML_datalifecycle_production.

--

--