The book I selected for July 2019 is Designing Data-Intensive Applications by Martin Kleppmann. This is a tome in the O’Reilly family of technical texts, and therefore I had high hopes. Spoiler alert: It didn’t disappoint.
Who should read it: Professional software engineers and computer programmers, especially those starting out in the field (yours truly) or those transitioning from single-node to distributed systems.
Why it’s great: The book covers a lot of ground, but prevents information overload by delivering its messages in concise and clear terms.
Where it could be better: Some chapters could benefit from more lead-in to a new topic.
Designing Data-Intensive Applications promises to give the reader a (hearty) introduction to the principles of building fault tolerant, consistent, and robust systems for handling large volumes of data on distributed systems. Notice I said “large volumes of” and not “big data”? In the preface, the author expresses his reluctance to use the term “big data”:
“Many of the technologies described in this book fall within the realm of the “Big Data” buzzword. However, the term “Big Data” is so overused and underdefined that it is not useful in a serious engineering discussion.”¹
Despite this disclaimer, in my opinion the book manages to neatly (although indirectly) address many of the inherent challenges associated with the notorious “V’s” of big data: volume, velocity, veracity and variety (try making a useful mnemonic out of those!) In my reading of the book, I felt that the information presented could be most easily applied to the first three “V’s”. To wit:
Volume is addressed in terms of the growing need to handle very large data sets, and, especially, to safeguard against loss or corruption of those large data sets. The author talks in detail about replication and partitioning, and compares and contrasts common approaches to storing vast amounts of data in a scalable and accessible way.
Velocity is highlighted in the need for users to have quick, responsive systems to interact with, retrieving and writing data with minimal latency or downtime. The author explains situations where large-scale distributed systems can hinder or complicate data reading and writing, especially in multi-user transactional systems. (These sections were some of my favorites, since I felt that the challenges of read/write asynchronicity, in particular, were well-illustrated and his explanations of potential mitigation plans very clear).
Veracity goes hand in hand with velocity here; the author shows that in some cases, there is a trade-off between data that is guaranteed to be accurate and up-to-date, and data that is guaranteed to be low-latency. Different systems might have different needs here, and the author highlights a few different techniques that can be used when either veracity or velocity are paramount.
What the above information gives you is a foundation for designing a distributed data system. The book does not promote one technology or design over any other in outlining how to design said systems; while it does make reference to some of the options available, it is not trying to evangelize a particular solution. Rather, by fully outlining the pros and cons of various implementations and setups, it gives the reader a starting point for deciding how they should make decisions for their own unique applications and data requirements.
In a nutshell, here is what I loved about the book:
- The book is split logically into three sections (Foundations of Data Systems, Distributed Data, and Derived Data) and into further chapter divisions from there. These sections are broad enough to contain a lot of useful information, but also delve deep enough to get past the basics and into some of the specifics . It also makes the book a handy reference — the section and chapter headings double as an index for zeroing in on particular topic.
- The book is very careful to explain and define any technical terms that it uses, and also clarifies the differences between some of the more common buzzwords that have infiltrated the field. In the journey from being coined to becoming part of the general parlance, these terms can grow fuzzier in definition, sometimes forking so that one word can be used to refer to multiple distinct concepts. And that’s confusing. To avoid perpetuating this, the author is deliberate in his selection of specific words and clearly defines them from the get-go (replication and partitioning come to mind). He’s also consistent with his naming conventions across chapters, helping keep ambiguity at bay, even if you don’t read the book in a linear fashion. When similar terms come up that could potentially cause confusion, the author takes the time to clarify what he’s referring to and to point out explicitly how the term is used in other cases or in other texts. For example, in Chapter 8 the author deliberately differentiates between the various ways that the word “consistency” is used across data engineering, and points out how each usage differs (however subtly) from the others² . Having multiple concepts with the same names and descriptions can quickly get confusing and cause you to second-guess (or worse, muddle) your understanding, so the clarification is appreciated.
- The diagrams are top-notch and truly help to illustrate the author’s points. In particular, his figures portraying synchronicity versus asynchronicity in distributed database reads and writes helped make the tricky examples crystal clear to me.
- This is one of those books that makes you excited about what you’re learning. The language is simple and key points are reiterated in every section, and because the text is written in an accessible way, it allows you to absorb the information without struggling through paragraphs of dry explanations. Comparisons between different technologies and techniques are laid out with clear pros and cons but without specific bias. When describing a new concept, the author adds specific examples or analogies to drive the point home. End result? I felt at the end of each chapter that I had truly developed a better understanding of the topics and could reliably recall some of the key messages — and I was eager to see how I could apply that knowledge at work.
- The book goes over just enough, but not too much. It covers the topics it is focused on very well, but doesn’t overextend and dilute the message.
One of the chapters that stuck with me most is Chapter 8, The Trouble with Distributed Systems. Before reading this book, I had only a vague understanding of the downsides of distributed systems. It seemed to me that the improvements in data resiliency and service up-time would completely engulf any downsides. But reading this chapter made me realize how naive that assumption was. I particularly liked this passage:
There is no fundamental reason why software on a single computer should be flaky: when the hardware is working correctly, the same operation always produces the same result (it is deterministic)… An individual computer with good software is usually either fully functional or entirely broken, but not something in between...³
What a lovely bit of foreshadowing!
The chapter goes on to elucidate some of the problems that are unique to distributed systems, and that catapult them from the deterministic behavior of single-node systems into something more complicated. I was struck in particular by the descriptions of what can go wrong with respect to clocks. While I have had some personal experience with the headache of dealing with unreliable clocks and conflicting timezones, I didn’t realize just how badly things can go wrong until Chapter 8. When dealing with very time-sensitive systems, something as minute (no pun intended) as the innate differences between quartz crystal oscillators on each node need to be taken into account.
When it comes to the bad, there isn’t much to report. One thing I would have appreciated is a little more lead-in to the topics at the head of each chapter. When a new topic is introduced, the book often jumps into the thick of it right off the bat, and a longer introduction could occasionally be warranted. In particular, the Partitioning section goes straight into methods of partitioning, and I think it could be well served by running through a brief recap of partitioning and potential use cases beforehand, especially as the book can so easily be used as an a-la-carte reference.
The book does not contain any lengthy code samples or instructions on implementing any of the tech it discusses. It is more focused on theory, so this book may not be the top choice for people looking for a step by step guide or more hands-on content.
All in all, I think that this book is not only an excellent resource, but also one of those rare things: a highly technical book that’s also readable cover to cover. Highly recommended!
 Martin Kleppmann, Designing Data Intensive Applications (Sebastopol, CA: O’Reilly Media, 2017), xvi.
 Kleppmann, 224.
 Kleppmann, 274.