Fledge over the flow …

How to determine a streaming problem.

Kian Jalali
Feb 20 · 6 min read

This is a story about a little experience of integrating (Akka-Stream, Alpakka-Kafka, Kafka, Alpakka-Slick, TimescaleDB) — Part 1/2

Fledge over the flow

Hi, I’m hajjijo, Data Engineer @ Sanjagh.pro

So today I would like to share my little experiences about Akka-Stream + Alpakka-Kafka + Alpakka-Slick integration to design a little fully streamed data pipeline.

I think this blog post should have 3 different parts:

1- Why we choose these tools

2- How we Integrate them

3- Which Issues & challenges we faced on (service clients)

Why?

Imagine that you have 3 services and there are communicating to each other (this communication is based on messages)(for example service A, B, C) there are many ways to implement the communication channels like use HTTP requests, Akka-remoting (over TCP use Netty) and many other ways we read about some of them below:

Problem

| How should we implement these communication channels?

So as you may know when we face a problem we shouldn’t recommend possible right tools at first. Depending on our case, we’ll suggest the top-level abstract answers that may solve our problem and then we research which tools cover our requirements.

In my case: We have a big monolithic service as back-end and some independent services created step by step over time. Now we want to break it down to micro-services (again step by step). So what should we do? Where should we start? What are our problems?

First, we choose a part of a project that is easy to decompose from other parts. This part should be easy to decompose and independent from the other parts. Finally, we choose the metric aggregator service to decompose. This service is responsible for aggregating our metrics and gathering them into a database so we could run complex analytic queries on our data. The first step is decomposing a metric aggregator engine to use it as a service. Well, this part of our story is the “creation of service”….

Step1:

Our team, by short research, found Prometheus+InfluxDB which are usually being used in common use-cases (use-case of metric aggregation). So we developed and we used it for a few months.

So why have we stopped using this stack?

  • Prometheus is powerful for a specific use-case in which we want (metric aggregation). But we need a way to break our monolith service to micro-services and decomposing metric aggregator service as the first step. (in fact, we like to test a process that is general-purposes enough to meet our requirements)

Step2:

After that, we decided to immigrate to TimescaleDB. As we should create some REST API endpoints for other internal projects and we expose them so we used Akka-HTTP . So from now, other services should send their HTTP requests to this new service. In this scenario whenever each service that likes to send a metric to our metric aggregator should just do one simple function call and this function designed in an external module (we called them service clients). These clients like libraries added to metric generator services (as a library dependency).

This way is one of the exchange patterns Request/Response.

Step3:

The way we handled the requests in Step 2 has some performance issues which emerge so fast. When we have thousands of metrics generated per minute and each function call for sending a metric to our aggregator service should wait for response and event, this HTTP model is a headache. So we migrate from Akka-HTTP and land in socket-based API +buffer. Now we need something to back-pressured the bulk of requests which are coming into aggregator service. For this purpose, we powered socket API with AKKA-Stream.

Step3 works for a few months but it’s not good at Scale and it’s hard to develop we need something better for being ready to scale-out…

For example, imagine that you need to add many metrics to this system. You should design endpoints and streams and models and worry about changing common endpoints and models you used because they are not backward compatible. And note that our use case is not just metrics we should design a general communication channel with many purposes and one of that purposes is metric aggregation.

Step4:

Step3 is suited to handle “Queries”, “Commands” and “Events” but why should we use this pattern for a huge bulk of lazy “Commands” and “Events” when we could use Pub/Sub pattern? and make our system more effective and powerful & Reactive while we have enough resources now… (as we previously referred Pub/Sub is a good pattern for working with “Events”)

So here is where Apache-Kafka will come ….

Will read at the next post:

1- how to design a Stream pipeline with AKKA stack.

2- how to use Alpakka-Slick and Kafka

3- how to write Kafka-producer & consumer client for a suitable function call.

4- how to work with time series and timescaleDB in this use case.

See you soon!

Sanjagh

Sanjagh Devblog

Kian Jalali

Written by

In type we trust! — A Scala fan & Data Engineer at @Sanjagh_pro

Sanjagh

Sanjagh

Sanjagh Devblog

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade