Insider Engineering

Insights from the tech minds behind our technology

Real-Time Feature Extraction with Spark Structured Streaming

--

ML/AI is a core component of Insider’s Customer Data Platform (CDP) to enhance customer profiles with predictions and recommendations to enable personalized experiences. A fundamental decision in generating these predictions is whether to use batch or real-time processing, as this choice shapes the system architecture. While real-time predictions are appealing, it is often more complex and costly to implement and maintain than batch processing. If there is no clear business benefit, real-time processing is not worth the added complexity.

Our machine learning platform implements the Lambda architecture. It includes a batch layer for feature extraction, offline feature stores, model training and batch inference, and a speed layer for real-time feature extraction and online feature store. A serving layer handles real-time inference. We use Apache Spark on Amazon EMR for both batch and speed layers throughout the ML pipeline. For more details about the overall platform, see our blog post Delphi: Insider’s Machine Learning Platform.

Lambda architecture representation of our machine learning pipeline

Spark Structured Streaming for Stream Processing

A stream processing system consists of data sources, a stream processing engine and sinks. There are several popular open-source streaming engines such as Apache Flink, Apache Spark, Apache Samsa, and Apache Storm. We opted for Apache Spark Structured Streaming as we were already using Apache Spark in our batch pipelines, allowing us to leverage existing know-how and infrastructure.

Spark Structured Streaming is a scalable, high-performance stream processing engine with exactly-once fault-tolerance guarantees. Data transformations and aggregations are expressed the same way as the batch processing using the Dataset/DataFrame API, allowing the same query to run on both static datasets and continuous data streams. That is a critical property for the ML pipelines, as it ensures that the online and offline features are computed the same.

Data stream as an unbounded table (Source)

Spark treats the streaming data as an unbounded table that is continuously appended. The Spark engine processes streams by running queries in fast micro-batches, updating both the intermediate state and result table accordingly. Spark is responsible for fault-tolerance, consistency, and the handling of late-arriving data, relieving the user of any streaming-related concerns. For more detailed information, refer to the Structured Streaming Programming Guide.

Requirement of the Realtime Feature Extraction

Several applications benefit from leveraging online data and real-time predictions to provide significant business value. For us, key business cases include predicting a user’s likelihood to purchase, recommending the most relevant products based on their current session, and improving search result relevance through session-based behaviour analysis.

Our machine learning models generate predictions to meet the aforementioned business needs. However, to enable near real-time business actions, raw online data must be aggregated to generate feature vectors, and the machine learning models must be deployed for real-time inference. These requirements are addressed by the speed layer and the serving layer in our architecture, respectively.

Architecture of the Speed Layer

The speed layer handles the generation of feature vectors for online data that is unavailable in the batch layer’s results due to batch processing latency. An end-user may conclude their session before the batch layer completes, which makes the speed layer essential.

To compute feature vectors using a series of transformations and aggregations on streaming data, we use Apache Spark Structured Streaming on Amazon EMR. The streaming data source is Amazon Kinesis Data Stream, which is a part of our data ingestion pipelines. The resulting feature vectors are then stored in an Amazon ElastiCache cluster, enabling fast access for real-time inference by the serving layer.

Architecture of Amazon Kinesis Data Streams Connector for Spark Structured Streaming (Source)

Integration with Amazon Kinesis Data Streams

While the older Spark Streaming (DStream and RDD API) natively supports the Kinesis integration, Spark Structured Streaming (DataFrame and Dataset API) does not offer this integration out of the box. Fortunately, the open-source community developed a Kinesis connector for Structured Streaming as proposed in SPARK-18165 to enable the Kinesis integration.

Kinesis Connector for Structured Streaming by Qubole was the standard approach until Spark 3.0, when maintenance for the project was discontinued. Following this, AWS released their implementation of Amazon Kinesis Data Streams Connector for Spark Structured Streaming. Starting with the EMR 7.1 version, this connector is integrated into the EMR software, eliminating the need to manage it as an additional dependency. Notably, this new AWS-provided library supports the enhanced fan-out feature, which dedicates 2 MB/s read throughput per shard per consumer, significantly improving stream processing performance, an important capability that previous implementations lacked.

Tips for Running Spark Structured Streaming on Production

  • Streaming applications are long-running by nature, so they should be configured to handle failures. Ensure the streaming job automatically restarts if it crashes. For stateful queries, like those likely used in feature extraction, enabling checkpointing is essential to ensure that the state is preserved when the application restarts.
  • Using the RocksDB state store can improve performance by decreasing the JVM memory pressure and long garbage collection (GC) pauses as the default implementation uses JVM memory whereas the RocksDB implementation uses native memory for the state data.
  • If micro-batch latencies are high, using Asynchronous Progress Tracking can help to reduce the latency by eliminating the time spent between actual micro-batch computations, which is used for synchronously writing the state checkpoints. However, this may introduce a higher delay in case the application restarts.
  • Monitoring the streaming health metrics is crucial to meet business SLAs and respond to breaches. Spark’s StreamingQueryListener can be attached to streaming queries to asynchronously send key metrics to any service. We push metrics like latency, duration, result table size and updated row counts to Amazon CloudWatch to monitor the vitals and set alerts for the critical thresholds.
  • Each Kinesis shard is mapped to a task in a Spark executor. The Spark application creates an equal number of tasks based on the number of shards and distributes it across the executors. Therefore, you can configure the number of Kinesis shards and Spark executors in relation to each other for better utilization.

Conclusion

Real-time analytics and predictions on online data can be highly valuable for businesses. However, building and maintaining a real-time machine learning pipeline is complex and costly. Such systems require the integration of data streams, stream processing, low-latency key-value stores, and scalable model-serving capabilities.

To meet these needs, we leveraged Amazon Kinesis Data Streams, Apache Spark on Amazon EMR, Amazon ElastiCache, and Amazon EKS to build a low-latency, high-throughput, multi-tenant, scalable, and fault-tolerant architecture.

--

--

Insider Engineering
Insider Engineering

Published in Insider Engineering

Insights from the tech minds behind our technology

Deniz Parmaksız
Deniz Parmaksız

Written by Deniz Parmaksız

Staff Machine Learning Engineer at Insider | AWS Ambassador

No responses yet