Building a Feature Store

Weslley Lioba Caldas
Aug 27 · 3 min read

A containerized approach using: Apache Kafka, Spark, Cassandra, Hive, Postgresql, Jupyter, and Docker-compose.

Image for post
Image for post
Source: Shutterstock

Extract features are one of the essential processes in machine learning pipelines. Unfortunately, when the data volume grows fast, perform repetitive operations in ETL pipelines becomes expensive. A simple solution for this problem is to build a feature store, where you can store features to reuse in different machine learning projects. This post’s objective is to propose a guide on building a feature store for studies purpose or deployment.

You can check more about Feature Stores here. Also, you can check the Git repository for this post here.

We will use the Butterfree framework to build our ETL pipeline. Accord with their authors:

The main idea is for this repository to be a set of tools for easing ETLs. The idea is using Butterfree to upload data to a Feature Store, so data can be provided to your machine learning algorithms.

The feature store is where features for machine learning models and pipelines are stored. A feature is an individual property or characteristic of a data-sample, such as the height of a person, the area of a house or an aggregated feature as the average prices of houses seen by a user within the last day. A feature set can be thought of as a set of features. Finally, an entity is a unity representation of a specific business context.

Simulating the following scenario:

  • We have a streaming JSON data source with events of Starbucks orders being captured in real-time.
  • We have a CSV data set with more information about drinks.

Objective:

We want to parse the JSON from the streaming source, performing aggregations operations, and store all rows in a cheap structure(like s3) and get more recent transactions on a low latency database like Cassandra.

We desire to have an output with the schema:

  • id_employer: int
  • name_employer: string
  • name_client: string
  • payment: string
  • timestamp: timestamp
  • product_name: string
  • product_size: string
  • product_price: int
  • percent_carbo: float
  • final_price: float

Solution using Butterfree library and the above architecture:

Image for post
Image for post
  • Apache Kafka as data sources (Streaming input data);
  • A hive metastore to store metadata (like their schema and location) in a relational database. (For this tutorial, we will use Postgresql)
  • Apache Cassandra to store more recent data.
  • Amazon S3 to store historical features or table views for debug mode. In this post, we will use the debug mode for simplicity, but you just need to put your AWS credentials to be able to use this feature.

All infrastructure was built upon docker-compose, based on this repository:

I have opted to separate every component in a different container, so you can easily replace what you need in a real deployment. Let’s explain each of them:

  • Cassandra: A Cassandra instance running to store online features.
  • Hive-metastore: Contains an instance of hive-metastore, one of the requirements of the Buterrfree framework. A metastore is useful to store tables and schemas and could use different databases. Here, PostgreSQL is the option chosen.
  • Postgresql: Contain an instance of PostreesSQL configured for hive-metastore.
  • Spark: Contain the spark configured with Jupiter notebook and python 3
  • Namenode: The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
  • Datanome: The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
  • Zookeeper: acts as a centralized service and is used to maintain naming and configuration, keeping track of the nodes’ status, topics, and partitions.
  • Broker: A Kafka broker receives messages from producers and stores them on disk keyed by unique offset. Also, it allows consumers to fetch messages by a topic, partition, and offset.

You need type docker-compose up -d to run the application. Open localhost/8888 and run the Kafka notebook to start the streaming process. After that, you can start the ETL notebook to create your feature store.

You can check the notebook that produces the ETL pipeline here.

Feature Stores for ML

AI, Data, and everything in between

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store