Streaming Data Platform at Exness: Overview

Gleb Shipilov
Exness Tech Blog
Published in
5 min readMar 10, 2024

Co-authors: Aleksei Perminov, Ilya Soin, Yury Smirnov.

Read more on Streaming Data Platform at Exness:

  1. Overview by Gleb Shipilov (you are here!)
  2. Flink SQL and PyFlink by Aleksei Perminov
  3. Deployment Process by Ilya Soin
  4. Monitoring and Alerting by Yuri Smirnov

In today’s business world, the ability to quickly extract valuable insights from constantly streaming data is not merely a convenience but a crucial necessity. At Exness, data underpins every aspect of decision-making: from improving customer relationships and enhancing operational efficiency to minimizing fraud risk and discovering new business opportunities.

To keep up with the fast pace of the company’s growth, our Information Technology department had to prioritize the transition to the event-driven architecture to improve agility and let teams operate more freely and quickly. At the heart of it all is Apache Kafka, which serves as a bridge between dozens of services within the company.

When teams switched to data exchange via Kafka, a distinct set of problems gradually emerged related to data processing in a streaming fashion, such as:

  • Enriching records from one topic with additional data from another topic;
  • Enriching data from a topic via a REST API and writing to a different topic;
  • Aggregating events by a time window;
  • Replicating data as is to a highly available storage, such as S3.

At first, each team was solving these problems on their own, which led to a growing number of internal tools overlapping in functionality and increasing development costs to maintain them. Some teams lacking experience in streaming data processing needed a tool enabling them to delve into development at the lowest possible cost.

To help our development teams out, we in the Data Integration team started building a company-wide self-service Streaming Data Platform (SDP) — the go-to solution for any streaming-related task.

Platform overview

Exness operates its data centers around the world and uses Kubernetes for deploying and managing containerized applications. Therefore, all our infrastructure must be deployed to K8S.

We decided to integrate Kafka Connect and Apache Flink into a single platform because developers frequently require a data flow that involves loading data from diverse sources into Kafka, processing it, forwarding it to other Kafka topics for consumption by other services in real-time, and simultaneously sinking the data into a database.

This decision significantly bolstered the effectiveness of our development process for streaming data flows, enhancing user experience, streamlining operations, and consequently reducing the time to market for new solutions.

The four main components of our platform are:

  • Kafka;
  • Kafka Connect (KC);
  • Flink;
  • Terraform.

Moving on from that, we will delve deeper into the components with more detail in the next paragraphs.

Choosing a Data Integration tool for Kafka

Considering that Exness has been transitioning towards event-driven architecture, we’ve accumulated a significant amount of data in Kafka, which needs to be continuously loaded into various databases. Additionally, a substantial portion of the data is stored in row and columnar databases, necessitating an efficient and scalable method for sourcing it into Kafka.
We also needed the data integration framework to be declarative, so that developers from other teams could quickly copy and paste configs and run a build pipeline to create new integrations, without getting bogged down into the internals of the framework.

We chose Kafka Connect for this task, for the following reasons:

  • Simplicity;
  • Many databases are supported out of the box;
  • Extensibility;
  • Easy deployment to K8S;
  • Supports deploying new connectors via REST API;
  • Supports CDC data replication via Debezium.

Choosing a data processing engine

The main criteria we considered when choosing a data processing engine were:

  • Rich streaming data processing capabilities out of the box: must be able to join streams, enrich data from REST APIs or other data streams, work with the state on per-event level;
  • Open Source: we need to be able to extend the solution with our custom formats, sources, and sinks;
  • Ability to declaratively define business logic in data processing pipelines;
  • Ability to deploy pipelines into the company’s existing infrastructure;
  • Ability to read from different Kafka instances within one job;
  • Designed to be horizontally scalable and fault-tolerant.

We were choosing between three solutions:

  • Apache Spark;
  • Apache Flink;
  • ksqlDB.

We rejected ksqlDB because, unfortunately, it doesn’t support all the necessary functionality for us. It can work only with one Kafka instance and doesn’t allow REST API lookups within the streams.

Another rejected option was Apache Spark. Spark was initially designed for processing large amounts of batch data, therefore its APIs are a great fit for batch processing but lack some features that were critical for us in the streaming environment. Some of the features we found missing are:

  • Lack of “co-process” functionality. This means there is no easy way to have a shared state between 2 data streams, which makes it difficult to read data from one stream and enrich it with data from another one;
  • Limited initial state support;
  • No support for Side Outputs;
  • Limited Timer semantics;
  • Limited state TTL support.

Apache Spark Community has ambitious plans to solve these problems in the future, but at the moment we had to choose Flink.

Lastly, we aimed to create our version of the tool to add new features and fix bugs independently. Our team primarily relies on Java, aligning well with the Java code base of Apache Flink. However, Apache Spark’s code base is in Scala, making it more challenging for us to modify.

Key features improving our platform’s usability include the SQL runner and Python runner.

You can find detailed information about them in the article Streaming Data Platform at Exness: Flink SQL and PyFlink by Aleksei Perminov.

And that’s not all!

Explore our other articles about:

Deployment of all the platform components by Ilya Soin and monitoring of the whole platform by Yury Smirnov.

Future Horizons: Expanding the Capabilities of Our Self-Service Data Platform

Consequently, 20 teams are currently engaged in developing their data integration flows through Kafka Connect, while 13 teams are focusing on crafting data processing flows using Flink. All these results have been achieved by a team of only 8 engineers. We are striving to encourage more and more teams to use our platform to help the business decrease time to market and increase the technological maturity of streaming data processing projects without increasing the headcount of our team.

Our journey in constructing this platform has been remarkable. However, our commitment to development remains steadfast as we envision numerous enhancements for our platform’s future, including a unified user interface for all components, streaming data analysis facilitated by SQL clients, and various other features aimed at enhancing user experience.

--

--