Data Pipelines — From Monolith to Event-Driven Microservices in Azure
This article will describe steps to convert a monolithic data pipeline into a collection of event-driven microservices in Azure. It will define the terms above and present a number of architectures with pro’s and con’s.
I was working on an ETL product for Microsoft Teams data that reported on usage, compliance and governance. The product obtained it’s data from Microsoft Graph via HTTP GET requests on a schedule.
The largest teams deployment we reported on, contained 270,000 users, generating 10’s of millions of new data points each day.
The ETL product consisted of a single solution and was deployed to a single VM per teams deployment. It was running right on the edge of not being able to keep up. Constant optimization’s were implemented to keep up with the unprecedented increase in consumption as a result of the Covid pandemic.
Around the same time, Microsoft introduced a webhook for call performance. This contained the raw data for things like call packet loss, jitter and round trip time. Naturally, in a time when systems were creaking under the strain, and remote communication was critical, customers wanted the ability the report on this.
The solution was only aware of how to pull data, it didn't know how to be pushed data and it certainly wasn't designed for it. The call performance webhook, for the largest customer, would be pushing 16 million call records a day, or 185 a second on average. They also wanted this data in real-time, they wanted to be able to respond to performance degradation as it happened.
Realtime reporting changes 2 things:
- Batching is no longer a viable strategy, we cant wait for data to build up.
- Peak throughput is more important that daily average throughput
The peak load of calls in any given day was about 1,000 a second, that is, 5 x higher than the average load. If the solution was not able to handle the peak load, there would become a backlog and reporting would no longer be real-time. It’s even more important to remain real-time in peak load, when you consider it’s purpose is to identify performance issues, most likely occurring in peak periods.
A solution that Extracts data from a source, Transforms the data, then Loads the data in a destination. This is also known as ETL.
Tightly coupled solutions that are compiled, tested, deployed and installed as a single entity. The monolithic solution contains all the logic.
The monolithic architecture is broken up and things become more loosely coupled. This is achieved by indirectly emitting and consuming events, rather than directly invoking functions. Calls become unchained and the system becomes more asynchronous. This results in the code become more maintainable. An event-driven solution could however, still be a single solution and have to be compiled, tested, deployed and installed as a single entity, like a monolith.
Microservices are loosely coupled. The monolithic solution is broken down into many services that, ideally, service a single context each. Communication between services is possible either directly using Rest, RPC or other direct communication mechanism; or indirectly using events (event-driven). This results in the ability to compile, test, deploy and install the Microservices as unique entities.
A Monolithic Data pipeline
A single executable that extracts data from a source, transforms it, then loads it into a destination.
A monolithic data pipeline is all or nothing, on the surface, this means that if a customer only wants specific features from your pipeline (like calls and not messages in Microsoft Teams), everything still has to be deployed. This also means that a change to one feature, would require the testing and deploying of every feature in the solution, because they are inseparable. It’s possible for bugs to appear in unrequired features that needn’t of been installed in the first place and it’s harder to find bugs in a larger, more tightly coupled solution. It’s possible for a single bug to take down the whole monolithic solution.
A monolith requires a single larger VM than multiple smaller ones. Hardware is more cost effective to scale horizontally than vertically. Redundancy and availability is poor in a single VM deployment.
Feature flags can mitigate some of the cons.
An Event Driven Data pipeline in Azure
In Azure, events can be emitted from an Azure Function to a ServiceBus queue, then consumed by a separate Azure Function.
So what does this look like in terms of a Data Pipeline?
Webhooks start us off, an Azure function, named “webhook” on the diagram below, listens to webhooks from a source. When the webhook function receives an event, it places it straight on a ServiceBus queue, named “Requests”. The idea is the webhook function should do as little as possible, to avoid loosing the event. Only when the event is on a ServiceBus in our pipeline, can we be confident we aren't going to loose anything. Retaining all events at this point is our number 1 priority.
Normally, event data contains references/ids to more static data, where events don’t make sense (users, for example). For this data, we have to go off an get it ourselves via GET calls on the source. A timer trigger function runs on a schedule to queue up those GET calls on a schedule. This is achieved by placing an event on the Requests queue with the specifics of the request that is due to be performed.
The ServiceBus queue “Requests” is consumed by an “Extract” Azure Function. The function takes messages put on the requests queue from the timer trigger. It also takes messages put on the requests queue by the webhook and asks the same source for the full details of the event. This is common, but not always the case. Some webhooks may send you the full message, removing the need to go back and ask for it. Once we have the full message, it is put on the “Responses” ServiceBus queue. Both this, and the previous step can be considered the Extract phase in ETL.
Now we have the source data, a call record, we need to transform it by applying mapping, calculations, aggregations and any other business logic to deal with every scenario we are aware of. Once the message has been transformed it goes two ways; to a “Transformed” ServiceBus queue and a “Transformed” EventHub.
The two paths are known as hot and cold. Hot data is the pre-requisite for real-time reporting. it’s data that is hot off the press. Cold data on the other hand is data that will be aggregated and reported over the days, weeks and months to come. Realtime reporting primarily looks at hot data, but can be enriched by cold data, for example, matching user id’s on a call record up to user names or departments.
The cold data path for Load works in much the same way as the previous Extract and Transform parts. It consumes transformed messages from the queue and then saves them in the destination. Because we’re on the cold path, we can now make use of batching strategies for performance. The load function can be configured in its ServiceBus binding to accept batches of messages and subsequently save them all in a single INSERT statement. This data can be reported on by apps like PowerBI. Aggregation stored procedures can be run on a schedule to reduce the load on the cold data store when appropriate.
The hot data path for Load is different to everything we’ve seen so far. It relies on some Azure magic in the form of EventHub and StreamAnalytics. Both these components together provide real time analytics in PowerBi. StreamAnalytics is the component capable of enriching real time data using the cold data store, with next to no code.
The main customer benefit is real-time reporting, but it also provides a number of benefits to development and operations.
E,T and L are now decoupled, so they can be worked on, and deployed individually if required.
The input and outputs of E,T and L are message queues, therefore the functions performing E,T and L can be integration tested really easily. The message queues have deadletter queues, so if there's an issue in production, the failing messages can be inspected, a bug fix can be deployed and the failures replayed, resulting in a faster time to fix and no lost data.
Finally, the architecture is all based on Azure PaaS, so we reap all of the benefits associated with this, like cost, availability and hyper scalability.
The problems we still have
Although we’ve managed to nicely decouple some executables (the individual functions for E,T and L) everything is still living a single solution file and everything inside the E,T and L functions are internally coupled. They could be broken apart further.
The db is still a tightly coupled schema.
There is still only a single dev pipeline with many exes. This means we have the ability to deploy individual exes, but all exes are rebuilt when we want to. This introduces risks around compatibility of different versions co-existing and knowing which exes have and haven't changed. The pipeline would be geared up to test all the new things together, not 1 new thing with 2 old things that are in production.
To achieve Microservices, a Monolith is broken down into services per context. So what is a context? It’s behavior that shares a common theme. There are two main ways behavior can be common. Firstly, by common domain, that is, unrelated behavior of similar objects . Secondly, by common responsibility, that is, similar behavior of unrelated (generic) objects.
An example of domain context for Microsoft Teams data could be Calls, Messages, Files, Channels or Users, each a separate context.
An example of responsibility context could be the Webhook logic, Schedule logic and Extract logic.
It’s important to note the Transform and Load logic is not a good fit for responsibility contexts, these are specific to the domain, and therefore better suited to be broken into domain contexts, for example, the Calls context would contain the Transform and Load behavior for Calls.
For a sanity check, a context, and thus it’s microservice, should represent boundaries of work. If a new property is introduced to a call, I want to only change a single microservice (for easier development, testing and deployment). I would need to introduce this to the Transform and Load logic for a call, but that's okay, because they both are part of the Calls microservice. Or, if I needed to change the throttling logic for calling Microsoft Graph, I only need change the Extract microservice. Have a think about common development stories and make sure this rule holds true.
One Context at a time
In order to move from monolith to microservices we need to take a methodical approach, starting from the bottom and working up.
- Split out the DB(database per microservice)
- Split out the projects relevant to the microservice
- Extract into a separate solution, moving any shared dependencies (DTOs etc.) into NuGet packages.
- Review, Learn, Improve, Next
Pipeline per context
Now we have microservices separated from the monolith, we can introduce a pipeline per microservice. There would also be separate pipelines for any extracted NuGet packages.
It’s actually quite nice to start the journey of discovering contexts at the pipeline. Have a think about what you want to deploy together, and what things you want to deploy separately. Think about what customers want to install. Can you shape your contexts so that only some of them are required if only some of the features are purchased/desired?
An Event Driven Microservice Data pipeline in Azure
Blue boxes represent microservices and green boxes represent NuGet packages. Here its important to note that if a queue exists between two microservices, the DTO’s (or message schema) should be made available in a NuGet package. This allows the two dependent microservices to be loosely coupled from one another. Furthermore, the NuGet packages can be versioned, dependent microservices can then independently decide which versions to adopt. Updates of DTO’s should not cause a chain reaction, unless of course, it’s a breaking change.
As a developer, I can work against a smaller solution, this results in less cognitive load, less risk, smaller builds for faster feedback, as a result, I’m more productive.
As a developer tester, I can write more focused, more reliable, more isolated tests, with less worry about regression. Integration and Scale tests become easier to define as boundaries are better understood. Feedback is faster and quality is higher.
An example of a integration test suite for the data pipeline. The yellow box represents external integration tests. The orange box represent integration tests between Functions/Responsibility. The blue box represent tests within domain contexts, or, context end-to-end tests. Because the boxes overlap, we can be confident our solution works as a whole, but it would always be prudent to run a smoke/spike test of the entire end-to-end data pipeline before release. These tests should conform to the testing pyramid, an inverse relationship between test size and quantity.
As a support engineer, I am capable of a reduced time to fix due to smaller scopes and better debugging using message queues/deadletters. Because of this, and because of smaller, more incremental releases, there is an increased uptime and reliability. There is also greater configurability of the system.
As a customer, I get all of the above benefits. I get real-time reporting and the solution is considerably cheaper to host than a monolith.
For example, 16 million calls a day, or about 1TB of data a month is capable of being run on Azure infrastructure costing less that $200 a month, the equivalent VM required to host the monolith, a 64 core VM, was costing $5,000 a month. (Prices on April 2021, both exclude 2 x $1,000/month Business Critical 4 core Azure SQL db’s)
- If microservice contexts are split incorrectly, an increase in brittleness and integration regression is likely.
- There are more (smaller) things to think about
- There is more latency in a distributed system, although monoliths have their own performance issues.
- Data is sent between systems, this introduces network and security concerns.
- This is hosted in Azure, some customers aren’t ready for the cloud.
We generally need all of the data for meaningful and insightful reports, but we’ve just split it all up.
Database as an interface
Data pipelines inserts data into the database, Reports read from the database, therefore the database, or more specifically, the schema, becomes the interface between these two components, normally developed by different teams.
The problem with this, is that one size doesn’t fit all. A data pipeline wants quick inserts and loose coupling, whilst reporting wants quick reads and tight coupling. On top of this, changes to the schema need to be carefully managed to avoid breaking the interface.
Data pumps are quite a commonly found thing in this arena, there’s many different types of them, for example, event data pumps, or backup data pumps like Netflix Aegisthus, that takes Casandra backups and use those to pump into their reports. You need to be careful to not introduce any complex mapping, as you’d end up with a second layer of ETL with the same problems all over again, it does give you the flexibility of picking different database engines per context and an entirely different engine for reporting, however. In essence, the data pump bridges the interface, and allows each side to be more flexible.
A Multi Schema DB allows Data pipelines to only care about one schema, and the reports can do cross schema queries to tie everything back together again. Replicas can be setup so Data pipeline inserts go to the primary replica, whilst report reads go to the secondary read only replica, this distributes load and gets away from some of the conflicting performance concerns. There is still the downside of having to share the same database engine (it may be more performant to write in Casandra, but report in SQL, for example). There’s also a risk of conflicting schemas.
Azure Stream Analytics is also able to do a similar thing in tying data back together again. Hot data can be aggregated and tied up with Cold (referential) data, before being passed to reports.
An event driven architecture and Azure PaaS components provide improvement over a monolithic VM deploy. Microservices can then layer further benefits on top. However, remember, context is key, think what you want to deploy and work back. If you split contexts badly, it’s going to be painful. Finally, database coupling is a hard problem to solve, so take care in this area.
- Service bus batching for further performance gains of Event Driven Architectures
- Sagas for transactional integrity
- Event Sourcing for a completely different way to process data and be able to view changes over time
- Kafka for Event Driven architecture on prem.