Book Review: Making Sense of Stream Processing

If you’re a software architect of any kind, I encourage you to read Making Sense of Stream Processing by Martin Kleppmann. It’s one of those free O’Reilly books they hand out at conferences, or you can choose to download a copy after giving your contact information.

Processing of data via streams is something you hear a lot about these days (especially alongside Big Data). However, this is one of the few books that clearly explains why you should stream data around your distributed system, rather than using centralized data stores. This includes large-scale SaaS applications that process large amounts of data — think of your favourite social media application such as Facebook or LinkedIn.

Throughout the book, a selection of practical use-cases are discussed, first with the traditional non-streaming solution, followed by a detailed explanation of how streaming-based solutions do a better job. These design trade-offs are what makes the book worth reading from cover to cover. They’ll make you think twice before building an application with a centralized database!

Although this book is sponsored by Confluent, the creators of open source Apache Kafka, at no point did it feel like a sales pitch. Instead, this book is about concepts and design patterns that apply to any data streaming solutions (including Amazon Kinesis or Akka Streams). Don’t expect to see much source code though — that’s not what this book is about.

Key Lessons

Here are the key takeaways that make this book worth reading:

There’s No Single Way to Access Data

For me, the most important message of this book is that it’s unwise to store all of your application’s data in a single database, using a single schema. Instead, different parts of the application should store the same data in different ways, depending on their access patterns.

For example, if you press the “Like” button on a social media post, this may look like a single update to the application’s state, but actually results in multiple things happening — notifications are sent via email, statistics databases are updated to track the number of likes per article, and machine learning algorithms are triggered to recommend similar articles.

It’s unreasonable for all parts of the software application to query the same database for all these different purposes, so instead, we send the “like” event on an event stream for all those different services to listen to. They can then store and process the data however is best for them.

Logging to Sequential Files Is Very Scalable

A second key point is that logging data changes to a sequential log stream is more efficient (considering disk access time) than adding, updating, or splitting a B-Tree index, as used inside relational databases. If your application doesn’t actually need random read/write access to data, then a sequential stream of changes is potentially more useful.

Also, log streams can benefit from partitioning/sharding to reach higher levels of scale, they provide a global ordering in which the events actually happened, and can provide the ability to replay events from the past (until you decide to purge old data).

Integrating Streams into Your Existing Application

If you’re working on an application that currently uses a centralized database, and you’d instead like to start streaming your data, consider the Change Data Capture approach. By listening to incremental changes (inserts or updates) occurring inside your database, it’s possible to generate a stream of changes, in the order they occurred. We then get all the benefits of streaming data changes around the system, without modifying the original application.

Similarity to Unix Pipelines

Towards the end of the book, there’s a great comparison drawn with the Unix shell pipelines that many of us have used for decades. Standard tools such as grep, awk, sort, and uniq resemble micro-services that perform a single function, and the pipe operator | joins them together by passing streams of character data.

Summary

I recommend that anybody interested in software architecture should understand the concepts from this book. It’s free to download, and at 170 pages you can easily skip through sections you’re already familiar with.

I’m looking forward to reading Martin Kleppmann’s latest book, titled Designing Data-Intensive Applications.