The basic architecture of data flow
As I look forward to the coming evolution of application architectures from client-server’s “stack” approach to a data flow centric model, I think it is important to have a “point on the horizon” to shoot for. What, at least as far as we can tell today, will be the central tenets of a “real time” or data flow architecture?
Several books have been written to address the concept of real time data flow handling. My favorite so far is Jay Kreps’ I ♥︎ Logs, which covers the concepts based on database logs that led to the creation of Apache Kafka (and, Kreps argues, was copied in the creation of AWS Kinesis). However, I think it makes sense for me to give a high level overview here, in my own words.
The basics of data flow
First, let’s think about the basic actions we want to take when we talk about processing streaming data. Basically, the image below show the flow of data from source to a queue to a client.
That’s a pretty simplistic view of the problem, but it gets the basic concept across. Sources of data send data to a queue, and clients that want to consume that data grab it from the queue and do whatever it is they want to do.
Unfortunately, we know in the real world that it is quite likely that the client will require some specific format for the data in order to interpret it. And even the queue might require the data to get into a “queue-able” format before accepting it.
Furthermore, the technologies and protocols needed to even connect sources to queues to clients might be widely different, depending on context. So, our model needs to take into account integration and processing faculties whenever data is moving between elements of our simple model, like in the next image.
Another consideration, of course, is that there will be more than one source, and more than one client.
Queueing (aka Event Logs)
OK, so we have a simple data flow model for real time data sources to queue data for clients to process as fast as they are able (or they see fit to do). This is great! But, how do we execute this model?
This is where real-time log and event processing systems, such as Kafka, Kinesis, NATS, Microsoft Azure Event Hubs, and Google Pub/Sub come into play. The heart of this flow is the queueing capability, and these tools and services are built for just that — publish and subscribe mechanisms that can accept vast amounts of data, queue them sequentially, then allow clients to come and consume that data in order at will.
The basics of these systems can be thought of like other messaging systems, where developers create messages containing their data, designate “topics” that are essentially named queues of these messages, and client systems request data from the queue in order to process it. In these realtime systems, typically data is processed first-in, first-out, and each message has a sequential identifier that guarantees any given program can compare messages to determine the order in which they were received.
A number of these options have additional message handling features, such as restricting how long a message can sit in a queue before it is processed, throttling of message throughput to control scale and cost, as well as a number of others.
Again, for a more in-depth explanation of how these event logs work, check out Jay’s book.
Integration problems are handled by both technology specific “gateways”, like AWS IoT Platform, and custom integrations utilizing functions attached to the queues (more on that in a second). The gateways typically handle one or more communication protocols, extract the data from those protocols, and forward the data to another data environment (in this case the queue, but it could be basic storage such as S3 or a data mining cluster, for example).
Processing is handled by event-driven function services, such as AWS Lambda, Microsoft Azure Functions, or Google Functions. I can’t emphasize enough how powerful this is. No need to do more than write a function that takes in data, does something to it, and then forwards it on.
As noted above, in addition to transforming data, these event-based functional “microservices” can be used to send data to another system, alert another system of anomalies (perhaps using machine learning models), or even dump it into storage for later historical analysis. The possibilities are vast and the most innovative ideas are yet to be developed, in my opinion.
I’ll try to cover a lot more about each of these elements in the future, but in my next post I’d like to take the basic concepts outlined here and derive a value chain model which can be used to begin to analyze the state of these technologies and the opportunities they create.
In the meantime, if you have questions or comments, please share them with me here, or on Twitter, where I am @jamesurquhart.