Streaming Events Architecture for Data and Analytics

Published in

Zattoo’s Tech Blog

5 min readJun 3, 2024

[Ongoing Data Architecture Design & Enhancements at Zattoo — updated through 2023]

Zattoo has always wanted to deliver a high-quality watch experience and our data has historically focused there: what users watch, how long they watch it for and the quality of the stream. However we would like deliver more of an enhanced end user experience, for which we would need to learn more. But learn more of What? Why? and How? Read on to find out …

Why are we doing this?

As our platform grows, we have to rethink the data we’re collecting to expand our focus on all the great features we’re developing, not just the watch experience. For example, understanding all the features a user engages with to find a show to watch (the guide, favourite channels, recommendations, searches, etc.). How a user interacts with ads and functionalities like skip, pause or play. Whether, where and which accessibility features do users interact with. Understanding our app performance with regard to all these features and many more, required a generic streaming events architecture capable of collecting, processing and analysing this data.

What did we consider?

Unlike many apps on the market, our users expect a high-quality experience across a wide range of devices with different codebases often with unique limitations. We need to work across not just smartphones, tablets and browsers but also connected TVs & set top boxes. Unfortunately, we can’t just deprecate anything more than a few years old!

Lastly, the original intention was to collect behavioural user data for generating insights about feature usage, but there was also a clear need for other types of streaming data like performing monitoring, telemetry or developing microservices.

After a long evaluation, we came to the conclusion that third party tracking solutions (like Google Analytics) wouldn’t work for our needs and we would need to build this data architecture ourselves.

How we did it!

We started discussing and brainstorming the architecture and components with the frontend and backend teams.

There were three main challenges that needed to be solved :

Data Collection at scale
Monitoring and visibility
Data Transformation at scale

Data is captured from client apps by a RESTful service zolagus that validates it against a schema registry and then pushes it to Kafka. From Kafka, data can be directly read through Clickhouse, while it is also simultaneously consumed by an Apache Beam pipeline orchestrated on a Flink cluster. The flink cluster and the event hub Zolagaus, are both hosted on an on-premise Kubernetes cluster. The Beam Pipeline writes the processed data as csv files to GCS, which is then imported to BQ by Airflow. — Streaming Events Architecture

Data Collection at Scale

In order to collect events at scale from all the various client platforms there was a need for a solution that can work across multiple ecosystems without the need for customisation for each and every ecosystem. After much deliberation and thinking about pros and cons of all the approaches that we considered , we decided to go ahead with a RESTful endpoint which accepts the JSON payload and can scale on demand. The data collection framework was thus built with the following components:

Data Contracts

One of the important parts of the architecture is to have visibility, documentation, and transparency among all the stakeholders. The client platform teams should know in a documented manner which fields , their data types, and constraints for the events that they will be sending. The schema of these events should be documented and enforced. The schema will evolve over time as the business requirements change and therefore there could be situations where multiple schemas are live at a given time. It could be for various reasons for e.g. some platforms might be slow in adding new fields or deprecating old fields etc.

In order to achieve this requirement we implemented a schema registry. This schema registry acted as a data contract between the data team and other stakeholders. We chose Protobuf as the schema format.

Event Hub

Once the schema of the events were frozen and platform teams were ready to implement the event capture, the next step was to develop a scalable application which will receive these events from multiple platforms. The idea was to develop a generic service which will act like an event hub and can be extended to multiple projects with different schemas and scalability requirements.

Thus a Golang application was developed . The idea was to keep the application as lightweight as possible by keeping all the business logic out of it. The only task for the application was to expose a RESTful endpoint which accepts a JSON payload. It checks the payload against the schema registry and forwards the correct events to the configured Kafka topic. The incorrect events were sent to the dead letter queue. We named this configurable, generic application as Zolagus. It was deployed on Kubernetes and was optimized , and designed to scale horizontally according to the load.

Kafka

We have a Kafka cluster in the organization which was used to receive the events sent by Zolagus. Each topic has its own retention policy configured as per the business needs and data volume. The topics were divided into appropriate partitions according to the scalability requirements.

ClickHouse

In a project where the data volume is huge and client rollouts happen all the time , it is important to have quick visibility on the data. This helps in quickly identifying issues and bugs . We had a ClickHouse cluster which we configured to receive events from the individual Kafka topics. ClickHouse has a powerful feature in the form of a kafka engine, and materialized views which allowed us to quickly configure a Kafka-to-ClickHouse pipeline. This was a low latency pipeline which dumped events in ClickHouse as soon as they were received by Kafka. We then built a Superset dashboard on top of the ClickHouse table to give real time visibility on the data.

Data Enrichment and Transformation

The data received in the Kafka topics were the raw events sent from the clients. The next step was to clean, enrich, and transform this data before delivering it to the analytics platform.

The idea was to build a reliable, fault tolerant, & scalable system which does this at scale. The system needs to be flexible enough to be enhanced without over complexity .

Since we already had a Kubernetes cluster on premises. We decided to deploy Flink on the kubernetes operator. This provided us the capability to deploy Apache Beam pipelines as Flink jobs in Kubernetes. This gave us the flexibility to migrate to Google cloud dataflow whenever we feel that the on premises architecture is not able to handle the reliability, & scalability requirements. Hence the following components were put to use while deploying a transformation plus enrichment framework:

Apache Beam

We wrote the data collection, basic cleaning pipeline in Apache Beam. This pipeline was deployed on the Flink cluster running in Kubernetes. The pipeline was written in Java. This pipeline collects data from Kafka in real time and after basic checks and transformation writes events in Parquet format in partitioned files in GCS. This provided us the capability of scalability, & fault tolerance.

Apache Airflow and BigQuery

The main data warehouse where the data transformation, and storage happens for us is BigQuery. We use Airflow to orchestrate the transformation tasks which are done in BigQuery. There are several steps of modelling and transformation before the final datasets are ready. The final table is then fed as input to the graphical event charting tool — Indicative, which is used for analysis by various stakeholders.