A containerized approach using: Apache Kafka, Spark, Cassandra, Hive, Postgresql, Jupyter, and Docker-compose.
Extract features are one of the essential processes in machine learning pipelines. Unfortunately, when the data volume grows fast, perform repetitive operations in ETL pipelines becomes expensive. A simple solution for this problem is to build a feature store, where you can store features to reuse in different machine learning projects. This post’s objective is to propose a guide on building a feature store for studies purpose or deployment.
We will use the Butterfree framework to build our ETL pipeline. Accord with their authors:
The main idea is for this repository to be a set of tools for easing ETLs. The idea is using Butterfree to upload data to a Feature Store, so data can be provided to your machine learning algorithms.
The feature store is where features for machine learning models and pipelines are stored. A feature is an individual property or characteristic of a data-sample, such as the height of a person, the area of a house or an aggregated feature as the average prices of houses seen by a user within the last day. A feature set can be thought of as a set of features. Finally, an entity is a unity representation of a specific business context.
Simulating the following scenario:
- We have a streaming JSON data source with events of Starbucks orders being captured in real-time.
- We have a CSV data set with more information about drinks.
We want to parse the JSON from the streaming source, performing aggregations operations, and store all rows in a cheap structure(like s3) and get more recent transactions on a low latency database like Cassandra.
We desire to have an output with the schema:
- id_employer: int
- name_employer: string
- name_client: string
- payment: string
- timestamp: timestamp
- product_name: string
- product_size: string
- product_price: int
- percent_carbo: float
- final_price: float
Solution using Butterfree library and the above architecture:
- Apache Kafka as data sources (Streaming input data);
- A hive metastore to store metadata (like their schema and location) in a relational database. (For this tutorial, we will use Postgresql)
- Apache Cassandra to store more recent data.
- Amazon S3 to store historical features or table views for debug mode. In this post, we will use the debug mode for simplicity, but you just need to put your AWS credentials to be able to use this feature.
All infrastructure was built upon docker-compose, based on this repository:
I have opted to separate every component in a different container, so you can easily replace what you need in a real deployment. Let’s explain each of them:
- Cassandra: A Cassandra instance running to store online features.
- Hive-metastore: Contains an instance of hive-metastore, one of the requirements of the Buterrfree framework. A metastore is useful to store tables and schemas and could use different databases. Here, PostgreSQL is the option chosen.
- Postgresql: Contain an instance of PostreesSQL configured for hive-metastore.
- Spark: Contain the spark configured with Jupiter notebook and python 3
- Namenode: The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
- Datanome: The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
- Zookeeper: acts as a centralized service and is used to maintain naming and configuration, keeping track of the nodes’ status, topics, and partitions.
- Broker: A Kafka broker receives messages from producers and stores them on disk keyed by unique offset. Also, it allows consumers to fetch messages by a topic, partition, and offset.
You need type docker-compose up -d to run the application. Open localhost/8888 and run the Kafka notebook to start the streaming process. After that, you can start the ETL notebook to create your feature store.
You can check the notebook that produces the ETL pipeline here.