MLOps: Data(Features,Versioning and Feature Store)

5 min readDec 22, 2023

As previously discussed in a productive setup we need to make sure that data is continuously ingested into our cloud platform. As discussed this is done via ETL, so that our machine learning model has data to train on. The extracted data is then put into a staging area for further processing. In this stage the data is raw: in order to be usable certain transformation steps need to be applied. This can for example include:

Filtering out unnecessary data or columns
Mapping data: such as converting dates to a common format, converting numbers to same unit etc..
Converting to UTC
Converting Caterogical variables to 0/1
Aggregating data

After the transformation has been done it is time to load the data into a database for example for further processing. In my experience a sample ETL process could look like this in AWS

Here you extract data from csv or other sources, put them into S3, transform them with AWS Glue, which creates parquets in S3, which are then loaded into AWS Athena. For further information please read the comprehensive guide from ULU Dogukan here https://medium.com/@dogukannulu/aws-cloud-data-engineering-end-to-end-project-aws-glue-etl-job-s3-apache-spark-967d6ebe1d88

Feature Engineering Pipeline

Once our data is ingested and transformed and loaded it is ready to generate features. This can include following steps:

Handling of outliers
Log Transform
One Hot Encoding
Scaling
etc.

Ok you can do feature engineering as part of your code where you also handle models. It is okay, when you have one model in a notebook where you also do the prediction. But let’s say you have to develop multiple models, which share the same features or you want the same feature processing code to run on training and inference, because you want the same preprocessing. In that case, it is best to decouple the feature engineering as a seperate step in a pipeline. It can for example be done with Tensorflow.

Get Started with TensorFlow Transform | TFX

This guide introduces the basic concepts of tf.Transform and how to use them. It will: Define a preprocessing function…

www.tensorflow.org

To decouple

Data Versioning

As previously mentioned with data versioning you can track changes in your dataset. It is actually quite similar to how you track changes in your software: namely with git. The industry standard for tracking changes in data is https://dvc.org/doc/start, which I have used in many projects. In this case you can track the transformed data into DVC to pull the latest or some specific data for training. So how does data versioning work exactly?

Let’s say you have defined your ETL and feature engineering pipeline. The ETL pipeline pushes data daily into our cloud. From there we have a feature pipelin, that generates feature. Sometimes you update the dataset, add new data to enrich it and so furth. So our changes in time could look like this:

To track these changes, DVC uses git to version changes to the data. These are the steps done by dvc.

Add the data you want to track. Much like git you can use a command line command

dvc add data.csv

This will create a .dvc file, which is a file containing metadata (containing informations about the dataset, all the necessary metadata) which is then tracked via git. If you modify your data, this will create another commit and so you can track the history of your data in git.

In that way you can get the same datasets for training specific models or can go back to specific commits, if something was wrong with your model we can go back to specific commits and test it.

The pipeline with data versioning could look like this:

Feature Stores: what are they, and do we need them ?

Let’s say you have features you want to reuse for training and prediction, have a lot of models which reuse the same kind of features, latency of retrieving the features is relevant or you have a lot of features you compute.

To design a pipeline with feature stores in mind you can keep in mind these ETL pipeline

You can see there is a step for feature engineering, which ideally would call a feature pipeline. After the feature pipeline, the features would be stored in the feature store, which could look like this:

There are multiple advantages of using feature stores:

With feature store you can make sure that the feature used for training and inference will be the same
If you have feature that are used by multiple models you can remove feature pipeline duplication
You have lower latency retrieving features from a feature store
Modern feature store applications have a feature store catalog where you can browse features in the feature store and reuse it

With that said, feature store really shine when you have a lot of features to compute, you have features that are getting reused, latency is relevant for you or training and inference congruency is a priority. For many companies, with only a handful of models a feature store might be too much. I am linking you this comprehensive guide for further information

Feature pipelines and feature stores — deep dive into system engineering and analytical tradeoffs

Introduction

medium.com