Designing A Real-Time Streaming Application By Using Kafka, KSQL & Socket.IO

Published in

Trendyol Tech

6 min readFeb 18, 2021

Trendyol is the most popular e-commerce site in Turkey. In November 2020, 60 million products were purchased by using Trendyol.

We created a platform where we can observe instant purchases.We have edited the data we pulled from Kafka topics with KSQL. We send this parsed data to the client side with Socket.IO. Let’s talk about the development details of our streaming application.

Stream Processing with KSQL

The data that we have taken from the client side is collected by Kafka topics. We will take this data from these topics and use it.

Our purpose is to group the sales data according to the cities in every second.

First we are parsing the sales source data for aggregation.

https://gist.github.com/vlcanunal/5a17aba2d010e87ccf652805a8aa1f62

As you know, stream data cannot be aggregated. We have to create borders by using tables for stream aggregation.

https://gist.github.com/vlcanunal/a6d9b393e290b0f4705773c09f890484

This query returns the data that we actually want. Since it is feeding the stream that we created above, we can say that it works real time.

As you might realize, we didnt collect our card count and amount by grouping them second. We did that operation on the backend side. When we developed this system, we cannot take the final value of window aggregation because of the KSQL version. There are 3 type aggregation method : Tumbling Window, Hopping Window,Session Window.

Tumbling window group the data according to the fixed time amount that we give as parameter. There is no gap or overlapping between the tumbling windows.

At the end of the query we used “EMIT FINAL” instead of “EMIT CHANGES”. “EMIT CHANGES” detect every changes and apply for every row. For instance if you want to sum card amount in an hour, it will return every summation to you. However the queries ends with “EMIT FINAL” returns only last value. This is one of the new feature KSQL version 0.12.0.

Hopping window technique is calculate every windows value by jumping the advance time value. In our example, our size value is 2 and our advance day is 5. It shows the total card amount and total card count of weekend shopping between 7 November and 30 November. Hopping window is very useful for this kind of special case. Hopping windows can also overlap with other windows. If we set the size value 5 days and advance value 1 day, it will calculate shopping values every day for last 5 day.

Session windows are determined according to the maximum offline period that we give as parameter. In our example, the session value is “2 hours”. Accordingly, in case of failure to receive data for 2 hours, the window will close itself and switch to a new window when the data starts to be collected again. If there is an uninterrupted data flow, the window will never close itself and will continue to receive data continuously. This type of window types are useful for systems that only evaluate active hours or days.

Real Time Communication: Socket.IO

As you can see, we created a data flow. The only thing we have to do is comsuming this data and send it to the client side. For this purpose we are using Socket.IO library.
Socket.IO is the Node.js module used to develop real-time applications that exchange data mutually.In our project, the data that we took from Kafka topic is sent to client side.

Since the KSQL version that we used didn’t have the “EMIT FINAL” operation for windowed aggregation , we had to operate the window data in node js part.

Socket.IO module retrives date in real time. However we should group the data for every 1.5 seconds if we want to create a pop up effect. We solved this issue by using a local variable : totalCityList. The totalCityList variable stores the city information of sold products at that moment. We are parsing according to our standards and send the data in finalTotalCityList”variable. If we were able to use EMIT FINAL at that moment, we dont have to take this data collection burden.

We are collecting data for every city indivudually in every 1.5 seconds and send it to the client side. The client side makes use of the total order count and total order amount data that we calculated here:

Retrieving Data From BigQuery

We have to send request to Google BigQuery for taking current active user, total shopper, total cart count, and total cart amount at that day or if there is an event, we can also retrive these values for the events.

Lets take a look to our most basic structure. When user visit the website, we are sending request to our server for getting the data. Our server provides this data from Bigquery. Wait.. Why we need to write our data to the local storage? Why we should take data from Big Query in every two minutes?

Lets deep in the structure!

At the first glance, it seems that there might be no problem with taking data from a cloud database. However, BigQuery has request limit per second. The system should not send request for every user. For this purpose, we create a structure. It doesnt matter how many user visit our city analytics page. If a user visit our web page, we are storing the current values in local storage. For other visitors, the system uses this local storage value and the system does not send any request for them.

Okey we understand how the system works. But.. How about events like Black Friday or Great Summer Deals ?

We are storing event dates in a python object. If there is a visitor, we are checking the current date with event start and event end dates. If the current date is between these start and end dates, we are making the event part visible below our map.

Container Management of City Analytics:

There are 5 different services in our program.

live-kafka manages the pop ups on the Turkey map. These pop ups show the current orders according to the citiesSince the pop ups are the main feature of our application, we have 5 different containers for load balancing.
event-kafka manages the total order count and total order amount of the current event. The working logic of the event-kafka containers are just same as live-kafka containers. The only different is kafka topics that we take the data and collection of the data. In live-kafka containers, we are showing the current values of the order. However, the event-kafka containers are summing up the whole values that we have in the current event period
sidebar-kafka containers create the horizontal slider that contains product information. The data source of this sidebar is the same as the live-kafka containers. In contrast to live-kafka, there is no summation in sidebar-kafka containers
python containers retrive data from BigQuery for the current values of total shoppers, total active users, total cart count and total cart amount as it is mentioned above.

Conclusion

Our goal in this project was to see how we can create a dashboard where we are able to check instant sales according to cities rather than directing sales strategies over certain metrics. Although the pandemic took many opportunities from our life, it provides us a great environment for working on real time data.

Daily routines that changed by the Covid-19 bring new challenges ahead and push us to examine process.

Stay safe!