Building a Feature Store with Apache Iceberg on AWS

Published in

Insider Engineering

4 min readSep 30, 2023

Insider has been leveraging AI/ML since the very early years of the company to differentiate in the MarTech field. Machine learning applications always supported the product line in different areas such as end-user segmentation, engagement predictions, product recommendations, personalised search engines and recently content generation and more complex use cases using Generative AI.

Individual Model Pipelines

In the initial days of our journey towards a robust machine learning infrastructure, each model we developed had its individual feature pipeline. This architecture was easy to develop and manage. We created a feature extraction query for every model type and ran it to generate the datasets.

This approach served us well in the early stages. However, as our model variety increased over time, the architecture became unmanageable. The maintenance of numerous pipelines became a time-consuming task, and developing a new model with similar input features led to redundant queries across pipelines.

We were clearly requiring a standardised and central feature store to streamline the process. The evolution from isolated pipelines to a centralised feature store was the beginning of a significant transformation in our ML infrastructure.

The Need for a Central Feature Store

A feature store serves as a centralised repository that stores pre-computed features, making them readily available for ML models. The term ‘feature store’ was first coined in a blog post by Uber regarding their machine learning platform, Michelangelo. The development of the Insider’s machine learning platform Delphi was inspired by Michelangelo when we started to design, build and migrate to the Delphi platform in 2019.

The transition to a feature store streamlined our ML processes and created consistency which was difficult to achieve with the individual model pipelines. The centralised nature of a feature store ensured uniform feature computation and storage, which in turn simplified the model development process significantly. The concept of a feature store was not just a solution to our problem but a substantial upgrade to our ML infrastructure as each model had access to a wider range of features.

The infrastructure of our Machine Learning platform Delphi.

The Infrastructure for the Feature Store

Our primary objective was to establish a standardised central offline feature store. We had been heavily using Apache Hive for storing our tabular data in Amazon S3, which was used for both ETL and ML pipelines, processed using Apache Spark on AWS EMR. Consequently, we leveraged our pre-existing knowledge in this domain for the feature store design.

In this architecture, each feature catalog is represented as a table, with each feature within the catalog stored as a column. These tables are partitioned by the client (ensuring multi-tenancy), date, and aggregation time window to optimize read queries. Every feature catalog within the feature store is associated with a feature extraction job in Spark, which computes the features using raw data on a periodic basis. This setup led to the generalisation of Spark queries from individual pipelines, pushing them back to the feature store pipeline, hence streamlining the model pipelines.

With the centralised feature store operational, every model pipeline was restructured to utilize the pre-computed features from this repository. This architectural shift not only simplified the data preparation step for each model but also significantly reduced the time and resources needed to create datasets, train models, and generate predictions, as it eliminated duplicate and redundant feature computations. The integration of the feature store with model pipelines marked a significant milestone in optimizing our ML workflows.

Scale of the Feature Store

As Insider has 1200+ worldwide clients, our solutions must be multi-tenant and scalable. A primary requirement for the offline feature store was to have a scalable, durable, and secure storage solution, all of which are well-addressed by Amazon S3.

Initially, we used Apache Hive as the data lake table format; however, as the number of partitions and data grew, Hive began bottlenecking our Spark jobs during the query planning phase due to its O(N) directory listing for partitions. This performance hiccup not only decreased efficiency but also increased costs, a trend that was worsening over time. To resolve this, we migrated to Apache Iceberg, modernising our data lake house and feature store architecture. This transition resulted in a 20–30% increase in Spark job performance and a dramatic 90% reduction in our Amazon S3 costs. More details on our Iceberg migration can be explored in our blog posts here and there.

The current state of our feature store has 10+ feature catalogs, 1000+ feature columns, and 500,000+ partitions across hundreds of clients. 3000+ feature extraction jobs are running on Amazon EMR every day and 5000+ model variants are trained every week. All of these jobs are orchestrated by Apache Airflow running on Amazon EKS.

Conclusion

The evolution of our machine learning platform has significantly simplified ML pipelines, reducing operating costs and enhancing model quality. The establishment of a central feature store decreased our model development time by weeks, if not months. While we enjoyed the benefits from the early adoption of a feature store, there was a cost associated with developing our in-house solution. Today, for those constructing an ML pipeline from scratch, leveraging a managed cloud service like Amazon SageMaker Feature Store is a more straightforward route to meet such needs.