MLOps: Data(Features,Versioning and Feature Store)

Sammer Puran
5 min readDec 22, 2023

--

As previously discussed in a productive setup we need to make sure that data is continuously ingested into our cloud platform. As discussed this is done via ETL, so that our machine learning model has data to train on. The extracted data is then put into a staging area for further processing. In this stage the data is raw: in order to be usable certain transformation steps need to be applied. This can for example include:

  • Filtering out unnecessary data or columns
  • Mapping data: such as converting dates to a common format, converting numbers to same unit etc..
  • Converting to UTC
  • Converting Caterogical variables to 0/1
  • Aggregating data

After the transformation has been done it is time to load the data into a database for example for further processing. In my experience a sample ETL process could look like this in AWS

Here you extract data from csv or other sources, put them into S3, transform them with AWS Glue, which creates parquets in S3, which are then loaded into AWS Athena. For further information please read the comprehensive guide from ULU Dogukan here https://medium.com/@dogukannulu/aws-cloud-data-engineering-end-to-end-project-aws-glue-etl-job-s3-apache-spark-967d6ebe1d88

Feature Engineering Pipeline

Once our data is ingested and transformed and loaded it is ready to generate features. This can include following steps:

  • Handling of outliers
  • Log Transform
  • One Hot Encoding
  • Scaling
  • etc.

Ok you can do feature engineering as part of your code where you also handle models. It is okay, when you have one model in a notebook where you also do the prediction. But let’s say you have to develop multiple models, which share the same features or you want the same feature processing code to run on training and inference, because you want the same preprocessing. In that case, it is best to decouple the feature engineering as a seperate step in a pipeline. It can for example be done with Tensorflow.

To decouple

Data Versioning

As previously mentioned with data versioning you can track changes in your dataset. It is actually quite similar to how you track changes in your software: namely with git. The industry standard for tracking changes in data is https://dvc.org/doc/start, which I have used in many projects. In this case you can track the transformed data into DVC to pull the latest or some specific data for training. So how does data versioning work exactly?

Let’s say you have defined your ETL and feature engineering pipeline. The ETL pipeline pushes data daily into our cloud. From there we have a feature pipelin, that generates feature. Sometimes you update the dataset, add new data to enrich it and so furth. So our changes in time could look like this:

Data Versioning with DVC

To track these changes, DVC uses git to version changes to the data. These are the steps done by dvc.

Add the data you want to track. Much like git you can use a command line command

dvc add data.csv

This will create a .dvc file, which is a file containing metadata (containing informations about the dataset, all the necessary metadata) which is then tracked via git. If you modify your data, this will create another commit and so you can track the history of your data in git.

In that way you can get the same datasets for training specific models or can go back to specific commits, if something was wrong with your model we can go back to specific commits and test it.

The pipeline with data versioning could look like this:

Pipeline with data versioning

Feature Stores: what are they, and do we need them ?

Let’s say you have features you want to reuse for training and prediction, have a lot of models which reuse the same kind of features, latency of retrieving the features is relevant or you have a lot of features you compute.

To design a pipeline with feature stores in mind you can keep in mind these ETL pipeline

ETL pipeline

You can see there is a step for feature engineering, which ideally would call a feature pipeline. After the feature pipeline, the features would be stored in the feature store, which could look like this:

AWS Feature store pipeline

There are multiple advantages of using feature stores:

  • With feature store you can make sure that the feature used for training and inference will be the same
  • If you have feature that are used by multiple models you can remove feature pipeline duplication
  • You have lower latency retrieving features from a feature store
  • Modern feature store applications have a feature store catalog where you can browse features in the feature store and reuse it

With that said, feature store really shine when you have a lot of features to compute, you have features that are getting reused, latency is relevant for you or training and inference congruency is a priority. For many companies, with only a handful of models a feature store might be too much. I am linking you this comprehensive guide for further information

--

--

Sammer Puran

I am a MLOps specialist/Data scientist working for the swiss national television. My experience in the data science realm is 5 years.