Apache Kafka as Primary Data Store

Vlad Krava
Jan 8 · 11 min read

Event streaming platform as a single source of truth for distributed platforms



Different Use-Cases Require Different Solutions

This statement may sound like an obvious one, however not every product follows this simple rule.

For the past years, Internet has become available to billions of people all over the globe by virtue of low connection costs and wide coverage, as well as maintenance price of processing centres nowadays is cheaper than ever, so, enterprise systems more often take into consideration possibility of handling, storing and processing every piece of information possible.

Some use-cases (i.e deposit transfer and goods purchasing) require data processing in a transactional way with the lowest latency possible — OLTP databases, other systems best suited for search operations with high-performance — search engines, as a result, documents in such solutions are not immediately available after storing due to a number of structural optimisations. There are multiple types of data storage systems which fit neither to transactional platforms nor search engines, like column-oriented databases and others.

In the end, software engineers choosing multiple data storage platforms with the most suitable algorithms and characteristics in order to satisfy expected use-cases. However, this requires a careful look at data storage characteristics, e.g “CAP” properties.


A Trade-Off: Consistency vs. Availability

There is quite entertaining “CAP” theorem developed by Eric Brewer:

“A computing system is not able to provide partition tolerance, consistency and availability at the same time. For specifically, a distributed computing system must choose two of the following three: partition tolerance, data consistency or availability.”

Partition Tolerance

A partition tolerant system can sustain any amount of network failure that doesn’t result in a failure of the entire network — connection loss between several system nodes or even racks. In this case, data blocks are still getting retrieved (if such present on the available node(s), this highly depends on consistency property), stored and replicated across available nodes.

Partition tolerance is a given as it comes to distributed systems. Hence, we are compromising only data consistency and availability.

Data Consistency

A system with strong data consistency guarantees ensures that all read requests from all available nodes will return identical sets of data, including the most recent writes. Such an expensive behaviour can be managed by introducing a transactional wrapper or similar engine which controls the status of all modification commands. This sort of transactional model gives the possibility to handle any shifts in data inconsistencies by rolling-back uncommitted state changes which are leading to requests failures — low availability trade-off.

Availability

Systems with high availability guarantees focused on providing basic read/write operations by ensuring that all available nodes will be able to handle such request, regardless how many components not serving if at least one node is “alive”.

Meanwhile, consistency guarantees are not respected — (low) eventual consistency trade-off.

This may happen or be done on purpose due to several reasons:

  • Improve performance for write-intensive systems where consistency is not that important (e.g Cassandra)
  • Appends may not be persistent after conflicts are reconciled
  • Reads may not get the latest write due to node(s) crash

On another hand no needs to worry about data loss in such systems — all data entries will be eventually available.


Data Flow in Polyglot Persistent Environment

The “CAP” theorem taught us that it never has been an easy thing as it comes to handling the data in a distributed environment. More happens when engineering team need to maintain a content variety on multiple storage platforms at the same time — Polyglot Persistent Environment.

Twitter’s Content Variety. Posts, Images, Analytics

From the first look to my Twitter’s page we may see different types of dynamic content being stored, served, cached and analysed: posts, images, some information I “might” like, etc. According to the official blog post “The Infrastructure Behind Twitter: Scale”, it’s not a surprise that different data are being handled in a different way in order to satisfy business use-cases and guarantee the expected behaviour.

Twitter’s Infrastructure Footprint. Data Ratio Managed by Storage Layer

The concept of Polyglot Persistent architecture comes from an idea of enabling different data storage layers for satisfying different needs within the given software platform, sometimes, even the same API, in order to benefit from specialities and minimise side effects of each data storage layer.

As an example, Apache HBase used by many products as reliable data storage layer with high write performance characteristics, but incredibly slow scanning speed. ElasticSearch — can improve low scanning performance, in this case, HBase still used as a primary database, even though the data for searching will be sinking to ElasticSearch as well. This can lead to another side-effect — index latency of one second, or so, with inefficiencies on single document requests. If needed, this behaviour can be solved by introducing Redis cache, or similar solution, which will hold recently stored data inside the cache in order to optimise store performance on ElasticSearch with the help of bulk loads which potentially can also workaround index latency. And so on.

However, this will require the construction of non-trivial pipeline. Given that a single store request has to be persisted to all listed storage layers regardless of technology and its behaviour:

Response time dependency on the number of storage/processing layers on synchronous execution with strong consistency guarantees.

An attempt to satisfy high consistency by storing a single document to each storage layer sequentially will lead to response time impact. An introduction of a concurrent approach to this flow, technically, solves an issue with response time, although such an approach far from being good, too. The possibility of request failure increases with a number of data storage layers being present — serious impact on system availability guarantee. Basically, we are facing rationalism of the “CAP” theorem.

The “Database Per Service” pattern, on another hand, can improve availability guarantee by decoupling storage responsibilities from primary “Service/Data Access” layer to separate (parallel) handlers, however, it may lead to another issue — permanent data desynchronisation in case of failure for one of the components.

An example of permanent data desynchronisation between several storage layers with strong availability guarantees in case of component(s) failure.

Permanent data desynchronisation it is not the same situation as we can get with eventual consistency, as none of the application components knows where is relevant data set and where is a corrupted one.

Hence, Polyglot Persistent architecture in combination with standard request handling mechanisms impairs not only consistency and availability guarantees from the “CAP” theorem but also overall system characteristics such as response time and may lead to a possible data loss.

The least demanding aspect related to given approaches, however, still far from being last by priority — architecture maintenance. Due to the number of dependencies between several data layers in Polyglot Persistent architecture, any code maintenance can turn into a nightmare.

It is time to make a difference.

Inside the Log-Based Architecture

Try to imagine infinite data stream at the starting point at index 0… And “voilà” — you’ll get a high-level representation of Monolog in Log-Based architecture. Such Log is append-only, ordered by time, a sequence of incoming entries.

Log Visualisation in Log-Based Architecture

In order to be efficient in real-time, distributed and highly available environment, Log should satisfy the following criteria:

  • Distributed. Besides the distribution of a basic component, it also includes such characteristics as replication, which will prevent entries being inaccessible during node(s) downtime, and also enables Log scaling-out feature. Scaling-out is a key aspect for all distributed platforms, as it helps to tune-up the performance of application services based on the load at a particular moment of time by increasing or decreasing the number of its running tasks. Otherwise, Log may end-up being a performance and reliability bottleneck for any application system.
  • Transparent. I love algorithms and different types of optimisations but not in the case with Log component in Log-Based architecture. Such component should be as simple as fast in order to perform only 2 actions: “append” and “get last uncommitted”, the remaining actions, performance-wise, are not important. Furthermore, a focus on any type of optimisations unrelated to those 2 actions obscures the underlying Log responsibilities and purpose of existence —a simplest possible storage abstraction. So, leave it to other layers.
  • Adjustable. Given criteria related to obtaining control on the “CAP” characteristics of Log, e.i. the Log can not be 100% strict to data consistency or availability attributes, furthermore, it has to be adjustable with regards to replication factor and other related settings. By strictly depending only on one or another guarantee, consistency or availability — Log turns into a bottleneck for the entire application system.

Write-Ahead Logging (WAL) — one of the key mechanisms against data corruption in database systems like HBase and PostgreSQL, shares a common principle with Log-Based approach — appending all state-changing events to a log-file. However, there is a major difference between these 2 approaches as well — WAL files are getting compacted by reaching a certain size or event, at the time, Log in Log-Based architecture designed to continuously accumulate entries and replay Log from any point of time.

The fact that Log is capable to store all historical data is another mind-blowing topic.

Besides a failure aspect, Log’s entries can be replayed whenever time comes to integrate a new data storage layer, which in its turn reduces development, maintenance and research (e.i pursuit of the most suitable technology) effort with overall flexibility, as this action can be done without putting a tremendous load on a platform itself.

Apache Kafka as a Log Component

We just ensured that many solutions revolving around such principle as WALs and Logs, furthermore, entire systems were built with an idea to serve as distributed, transparent and adjustable Log components. One of the best candidates on this role, if not the best IMHO, is — Apache Kafka.

Technically speaking Kafka is still serving its most primitive purpose, being a message broker, like Apache ActiveMQ and other implementations of JMS API, that is where a query: “Kafka vs. JMS”, comes from, however, Apache Kafka is something bigger, something more suitable to serve a role of a Log component due to a number of additional features, like:

  • Real-Time Processing with Kafka Streams.
  • It is distributed and highly available component, though still allows adjusting a correlation of availability/consistency with replication-factor, min.insync.replicas and acks properties.
  • Can store petabytes(!!!) of messages (entries) indefinitely directly on the filesystem.

Overall, it is a perfect match for the Log component based on given criteria.

Monolog vs. Monolog with Compaction vs. Polylog

Partitioning is a key to Kafka’s performance, fault tolerance, scalability and flexibility features. Each partition in Apache Kafka is completely independent, ordered by the time sequence of messages, hence, in order to guarantee a written entries order to a single Log component, the configuration should look like:

1 Topic + 1 Partition = Monolog

Monolog is the most simple structure in Log-Based architecture, it still works fine for small/medium size datasets where Monolog’s size does not exceed a disk capacity in a long-time perspective. In the case with Monolog, horizontal scalability feature will be disregarded, as only entire partition-unit can be placed on a separate node.

In such a scenario, a logical question about the necessity of keeping historical data may appear. Log retention policy can be enabled respectively without any changes to Monolog structure — “1 Topic + 1 Partition”, which might be quite a valid approach in the case with abandoning an all-time log-replay feature — Monolog with Compaction.

However, there is much more elegant and versatile way to enable a horizontal scaling without compromising all-time log-replay functionality — Polylog. The idea which stands behind the Polylog is ridiculously simple: if there is no way to split the topic’s partition without losing time-order guarantee, we are going to split the topic.

(1 Topic + 1 Partition) * N = Polylog

There is no straight forward rule by which Monolog can be shredded to a Polylog structure, all criteria are fine as soon they give decent distribution.

Applying the Log

Lesson learned. Both, synchronous and asynchronous parallel writes in any system, especially in Polyglot Persistent environment are leading to huge problems with response time or such disasters as permanent data desynchronisation and others. In order to solve all these issues, a parallel writes approach needs to be avoided, instead, Log component will take a role of a primary data source and which is more important, will also serve as a single source of truth for remaining application components and data layers.

Log-Based Architecture with Polylog in Charge of Polyglot Persistent Environment

What’s with consistency/availability guarantees for Log component? No issues there: a Log component, i.e Apache Kafka cluster, can be efficiently configured in order to satisfy decent availability without data loss or consistency compromises. An example of such configuration: 3 broker nodes, RF of 3 and ISR of 2 with acks set to all — detailed guideline can be found here.

What’s with high-enough availability on application level? With one of the lowest latency possible ~1ms (in the end, we are still dealing with asynchronous messages and replication), Log entries will be handled by completely independent consumers (multiple consumer groups) which in their turn will feed the data to other layers. In such a way whenever a complete failure for one or several data layers occur, it will not provoke the failure of the entire system.

What’s with high-intensive delete and update operations? The strength of Log-Based architecture lies in a field of high intensive read and write operations. By having a desire to satisfy also high-intensive deletes/updates, a pattern CDC (Change Data Capture), or similar, has to be applied respectively to the Log.

Quick… Conclusion

Log-Based architecture with Apache Kafka in charge, as one of the best representatives of systems which were designed with a word “Log” in mind, tend to resolve issues related to extensive component dependencies, response time and data loss in Polyglot Persistent environment with high-intensive read/write operations.

The Log component is able to provide good consistency/availability guarantees on the upper level, at the same time by loose-coupling remaining data layers, Log-Based architecture improves overall platform availability.

On another hand, Log-Based architecture isn’t a cure against all problems, as the Log component can’t work effectively with high-intensive delete/update operations due to its nature, which is also fine for most of the cases. However, a separate CDC mechanism can to be applied on top of the Log respectively.

Experiment and share your thoughts!

Further Reading

Vlad Krava

Written by

Engineering technologist, full-time lucky man https://twitter.com/vkrava4

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade