Pipe Dreams

How messaging, scalability and failure formed a new kind of architecture.

Scott Haines
97 Things
2 min readJun 8, 2019

--

Source: Pixabay

One subtle and novel paradigms in computer systems architecture is the concept of message passing. This simple, albeit useful construct allowed for magnitude gains in parallel processing by allowing processes (applications/kernel tasks/etc) within a single computer OS to take part in a conversation. The beauty of this was that processes could now participate in conversations in either synchronous or asynchronous fashion and distribute work amongst many applications without the necessary overhead of locking and synchronization.

This novel approach to solving parallel processing tasks within a single system was further expanded to distributed systems processing with the advent of the message queue as-a-service. The basic message queue allowed for one or many channels (or topics) to be created in a distributed FIFO (first-in, first-out) style queue that could be run as a service on top of a network addressable location (eg. ip-address:port). Now many systems across many servers could communicate in a distributed work share style decomposition of tasks.

It isn’t hard to conceptualize how the pipeline architecture grew out of the concept of distributed systems communicating over network addressable message queues. It kind of makes sense on a macro level. Essentially, we had all of the components of the modern pipeline architecture sitting on the assembly line just waiting to be assembled. But first we had to solve a tiny little issue which was failure in the face of partially processed messages.

As you may imagine, if we have a distributed queue and we take a message from that queue, then we could assume that the queue would purge said message and life would go on. However, in the face of failure — no matter where you point the blame — there would be data loss associated with a message that has been purged from the queue where work was lost in process without a means of recovering. Now this is where things evolved.

Given that we wanted to ensure our applications would complete all work they took from the queue, it made sense to store a log of the messages within a channel (or topic), and allow our systems to keep track of what they consumed and essentially where their processing left off. This simple idea of acknowledging where an application was in the queue led to the concept of offset tracking and checkpointing for a consumer group within a queue. Apache Kafka was the first project that treated a message queue as a reliable, and more importantly, replay-able queue that could easily be divided and shared amongst multiple applications within a shared consumer group. Now there was a reliable, highly available and highly scalable system that could be used for more than just message passing, and essentially created the foundation of the streaming pipeline architecture.

--

--

Scott Haines
97 Things

Distinguished Software Engineer @ Nike. I write about all things data, my views are my own.