How Kafka helped us to restructure our Analytics solution

Leonardo Gomes da Silva
Mobimeo Technology
Published in
5 min readApr 8, 2021
Photo by Luke Chesser on Unsplash

What you will find in this post

From our first implementation to a scalable and flexible solution: this post aims to show you how Kafka helped us to redesign our analytics solution, all the phases we had, the issues we faced and the solutions we decided to adopt. We are not covering all the details in our solution like our process for schema management, handling invalid/non-matching events or AWS Athena tables management, this post focuses on the utilization of Kafka through the transition of our solutions.

The beginning of Analytics

As in many other companies, at Mobimeo we also want to get insights about our product and the use of our apps, and having a proper analytics solution is a crucial part of the game. Our first analytics solution was built some years ago and it had a pretty simple design: clients could send events to our service either through a Message Broker (ActiveMQ) or through a Rest API as JSON. When the events arrived in our service they were uploaded into an S3 bucket, being then available to be queried through AWS Athena. The flow worked as shown in the image below:

After some time, we realized that working with JSON was a problem for us for some reasons:

  • JSON does not enforce any structure, this caused many inconsistencies after some event structure changes
  • There was not a clear definition for each event, each client (iOS and Android) was defining their own events with their own names and structures. This made our life harder when we had to get some results based on data from all the types of clients.

Our approach at that time was Code First, meaning that the code we wrote in the clients were defining the content of our events. After understanding the issue, it was clear to us that we needed to centralize and organize the way we work with our events. So we changed our approach from Code First to Contract First, which means that we first write the contract (definition) of our event before writing the code and, by having the contract defined, our clients must follow it.

Welcome AVRO

To solve this problem we decided to use AVRO. AVRO is first-class citizen when working with the Kafka Platform since we can rely on the Confluent Schema Registry and also provides native support when used with other Kafka Libs/Frameworks (Connect, Streams, KSQL).

By having the event schemas centralized in our repository we managed to generate the necessary code for our clients so they could use them to enforce the required structure.

What came together with AVRO

With AVRO in place our architecture changed a bit, such as:

  • Message Broker was removed since the Rest API integration proved to work better for our mobile clients
  • Messages received as JSON were published to a Kafka Topic (when the events matched an AVRO schema)
  • An S3 Sink Connector was configured to put the events in AVRO inside a new S3 bucket
  • Previous process of JSON events being sent to S3 bucket wasn’t removed in order to keep current Athena queries working

The flow can be seen in the image below:

At this point we had a solution that worked quite well when it came to querying structured data (AVRO in our case), but we had some other issues:

  • Rest API was depending on both S3 and Kafka
  • Rest API was doing too much: apart from sending data to S3 it was still trying to match each and every JSON event to an AVRO schema and then publishing it to a Kafka topic
  • AWS Athena performance was bad when we had to query events coming from the JSON bucket., The reason being that every file contained only one JSON event, which means that that Athena needed to read more files and this made the process much slower

The final solution

Analysing all the current issues, we came up with our current solution:

  • Rest API is only responsible to receive events as JSON, do some sanity check and publish them to a Kafka Topic (no more S3 dependency)
  • JSON events are going to be put into the S3 bucket by an S3 Sink Connector, but now a single file containing multiple events, which makes our Athena queries faster
  • A new component, a Kafka Stream App, was created to consume JSON events from Kafka and try to match them with an AVRO schema, also publishing these events as AVRO to a new topic
  • AVRO events are going to be put into the S3 bucket by another S3 Sink Connector

The flow can be seen in the image below:

With the new architecture we managed to fix our previous issues of service dependency and query performance. In addition, we have some more benefits:

  • Easier to debug and reproduce errors: since we now have all the raw events as JSON inside Kafka, we can check why some events didn’t match to any AVRO schema and also retry to match these events
  • Flexibility: in case we want to send our events to some external tool for doing further analysis, we can use another Sink Connector from the JSON or AVRO topic or even implement a new Kafka Consumer to send data out of Kafka

--

--