Realtime Production Monitoring & Alerting using Machine Learning, ElasticSearch & Grafana

Anas El Khaloui
hipay-tech
Published in
4 min readJun 20, 2023

In today’s complex business landscape, ensuring seamless operations and identifying critical issues in real-time is a formidable challenge. Gone are the days when extensive teams would monitor charts and maintain thousands of alert rules to keep important systems running smoothly.

This article dives into an innovative and fully functional approach of leveraging machine learning, ElasticSearch, and Grafana to achieve real-time production monitoring and alerting.

Payment platforms are complex products

HiPay’s all-in-one payment platform serves the needs of thousands of clients while interfacing with a large network of technological and financial partners. The intricate nature of HiPay’s ecosystem presents a significant monitoring challenge. Merely ensuring server uptime is insufficient. It is vital to guarantee uninterrupted payment flow, with a special emphasis on timely detection and response. After all, a momentary payment disruption can means financial & customer loss for customers.

The building bricks of efficient payment monitoring

To address these challenges, several key elements come into play:

  1. Real-time detailed data: Obtaining comprehensive and up-to-the-minute insights into platform activities for all clients is the first step. This requires the ability to capture granular data on payment transactions, system performance, and customer interactions. Armed with such information, you have gained a holistic understanding of you operations. We use payment logs parsed in real-time and stored and indexed into ElasticSearch.
  2. Establishing normal payment activity at scale: HiPay’s diverse client base comprises a range of businesses, from large international online retailers to small local craft shops. Each client has unique payment patterns, which fluctuate throughout the day. Successfully discerning what constitutes “normal payment activity” for each client at any given moment is essential task. This needs to be robust and automated. We use time-series modeling and forecasting to predict how payment data shoulf look in the future for each merchant.
  3. Continuous monitoring and alerting: Once the baseline of expected payment activity is established, a robust comparison mechanism is needed. Constantly measuring the ongoing performance of individual clients against their expected behavior enables timely detection of anomalies. When discrepancies are detected, appropriate alerts must be raised, allowing swift intervention of our Ops teams to rectify any issues. We use Grafana dashboards and alerts for this.

By the way, we’re looking for a Data Engineer proficient in ElasticSearch to join the Team, here’s the job description 😉

How we did it

Overall architecture:

How we designed the solution

Technically speaking:

Technical components and data flows

Two streams run in parallel here.

Batch stream: producing accurate predictions on a weekly basis

The purpose here is to generate precise predictions every week and load them to the ElasticSearch data storage. High quality predictions are important because they garantee a low rate of false alerts. We use our time series machine learning (time series forecasting) engine , scaled thanks to the use of DataFlow Prime (Google Cloud) and designed and battle-tested on multiple payment-related prediction use cases.

This stream leverages a classic modern data stack based on Airflow, BigQuery, Meltano and dbt.

The resulting predictions are computed once a week and uploaded into ElasticSearch indexes.

Real-time stream: constant surveillance with the ElasticSearch stack and Grafana

ElasticSearch, a highly scalable and powerful data storage and search engine, acts as the central repository for HiPay’s real-time monitoring data. The utilization of ElasticSearch indexes ensures efficient data storage, retrieval, and blazing-fast search capabilities.

Sitting atop ElasticSearch, Grafana takes charge of data visualization, comparison analysis, and alert management. Grafana’s versatile dashboarding displays monitored metrics, empowering us to gain real-time insights into payment performance. Furthermore, Grafana computes alerts and triggers notifications, enabling timely intervention in response to anomalies. The entire provisioning and maintenance of this stack is achieved through Terraform infrastructure-as-code, ensuring consistency and ease of deployment.

The yellow zone is the “normal zone”, based on our ML forecast. The green line is the actual number of payment transactions. If the green line stays for more than 30min out of the “normal zone”, an alert is triggered and our infrastructure teams will look into the issue.

This has been delivering exceptional results so far, despite a clearly undersized ElasticSearch cluster. ElasticSearch (Apache Lucene under the hood) shines in swiftly aggregating and serving data, and well-tuned time series ML with Python can be easily scaled to fit tens of thousands of metrics using adaptable cloud services like Google BigQuery and DataFlow.

Thanks for reading !

Many thanks to @vlegendre (Data Engineering) and @lvial (Data Science) for the awesome work on this project 🙏 🔥

--

--

Anas El Khaloui
hipay-tech

Data Science Manager ~ anaselk.com / I like understanding how stuff works, among other things : )