Processing over 450M events per day in real time with AWS

Neylson Crepalde
4 min readJan 14, 2024

I recently had the pleasure of speaking at AWS re:Invent 2023 to share the story of how my team and I built a real-time log analytics solution capable of processing over 450 million events per day for a large financial institution in Brazil. In this post, I'm gonna tell you how we worked hard on building and (specially) optimizing this solution until it was production-ready.

The Challenge

Our client was facing a significant challenge — they needed to analyze multiple terabytes per month of system, application, and AWS logs to identify potential security incidents in real time. Their security rules were also highly volatile, changing multiple times per day, so having their security team manually update rules and parse through logs was not feasible.

They already had a process to aggregate logs into an S3-based security lake, so we had our centralized data source to build on top of. But we needed to create a scalable architecture that could:

  1. Load and parse JSON log files from S3
  2. Allow their security team to easily define, update and manage security rules
  3. Run streaming computations to identify security events based on those rules
  4. Visualize results and alerts in real time dashboards

No small task!

Our Initial Architecture

Given the real-time processing and visualization requirements, we started thinking about AWS services that could help us build this type of pipeline included:

  • Amazon EMR for distributed data processing using Spark Structured Streaming. This gave us scalable, resilient, sub-second streaming directly from files in S3.
  • Amazon OpenSearch for real-time dashboards and alerts using OpenSearch Dashboards. OpenSearch makes ingesting and visualizing streaming data simple.

The core missing piece was letting the security team manage rules dynamically. For this, we built a custom web application using Python and Streamlit. The application let analysts add, update, activate and deactivate rules which we stored in a Amazon DynamoDB table.

Here was the initial architecture:

Diagram by the author

This design worked, but we quickly ran into a few serious challenges:

The Trouble With Our First Try

  1. OpenSearch sizing is hard — undersizing would crash under load, oversizing would get expensive. We needed auto-scaling.
  2. The pipeline wasn’t fast enough for 450M+ events per day. We identified bottlenecks related to many tasks being performed sequentially in the streaming queries.
  3. Getting data from Spark into OpenSearch involved custom coding instead of using native Spark connectors.
  4. Frequent calls to refresh DynamoDB rules and to write the results in S3 slowed down the streaming performance.

Back to the Drawing Board

After a few weeks of performance testing and optimization, we implemented several key improvements:

  1. Added a preprocessing query to convert JSON logs to Parquet for faster processing.
  2. Split out the pipeline into two parallel queries — one for OpenSearch and one for S3. This improved latency for real time data viz.
  3. Used opensearch-py Python library to index data into OpenSearch
  4. Instead of reading the DynamoDB table in every iteration, we read it as a fixed DataFrame and updated Streamlit to restart the streaming query every time there was an update.
  5. Configured bulk parallel indexing into OpenSearch.
Image by the author

With these optimizations, we were able to achieve our goal of 450 million events per day, with room to scale further!

The Outcome

This real-time security pipeline had a huge business impact. By processing security logs in seconds rather than hourly or daily batches, the security team could identify and respond to threats in minutes rather than days. For a bank dealing with sensitive financial data, this dramatically improved their security posture.

Additionally, by demonstrating a successful high volume real-time pipeline on AWS, we helped evolve the traditional “batch processing first” mindset often present in financial services. This led the client to invest more in streaming and low latency architectures.

By leveraging technologies like Spark Streaming, OpenSearch and DynamoDB, we built an agile, scalable, and high performance architecture that solved the customer’s needs. The project shows how combining AWS’s fully managed services with custom development lets you create solutions capable of ingesting, processing, and analyzing vast volumes of data in real-time.

If you face similar large-scale, time-sensitive processing challenges, I hope you find some useful lessons and architectural patterns here. I enjoyed sharing this story of how my team pushed the boundaries of what we thought possible with AWS. Please reach out if you have any questions!

--

--