Open Streaming Data Architecture Using Azure

An open and portable approach for integrating and persisting real-time data on a distributed storage enabling advanced analytics use cases and ad-hoc analysis at low cost

Patrick Pichler

Published in

Creative Data

9 min readJul 16, 2021

Open Streaming Data Architecture (Image by Author)

Introduction

The last couple of years, the demand for capturing, processing and analyzing real-time data has been growing immensely. Organizations want to benefit from real-time data the same why as they benefit from data at rest since decades. Yet streaming data architectures are not that simply to implement and hence organizations often struggle to get started, above all since such solutions are supposed to run 24/7 without interruptions. For this reason, organizations often rely on cloud services as they include this high availability by default. They are generally easy to use and hence speed up time to value. Organizations can so quickly gain insights into their real-time data and can focus on their business rather than maintaining servers. However, the flip side of the coin is that there is a kind of risk to become too cloud or vendor dependent as so often when it comes to cloud computing and things can get out of control. This article demonstrates a way to make usage of the convenience enabled by cloud services, but at the same time solely rely on mature open standards. The idea is to be future-proof for the evolution of new technologies and to be always able to move away from the cloud at any point in time.

Architecture

The architecture is therefore proposed in two ways using basically the same underlying technologies. The first one makes usage of the Azure cloud platform by keeping just the data source connectivity on-premises and the second one uses the same technologies as the first one just everything is self-hosted and totally open source.

Data Integration

Apache NiFi is the only tool which is actually used in both approaches, in the self-hosted environment as well as for integrating data on Azure. The reason for this is that the connection of data sources is the architecture’s first point of contact and keeping this open in your local environment allows to route data to any cloud or other downstream system. For instance, using any cloud provider specific on-premises suite deeply integrated in your local environment would already mean a certain dependency at the very first stage. Besides, there isn’t any other real-time data integration tool available which offers so many features which are vital for streaming data to ensure high throughput and no loss of data. The Apache NiFi’s high number of built-in processors together with the drag-and drop experience makes it further incredibly easy to move real-time data between almost any source and any target regardless of cloud-based or on-premises.

After the streaming data sources are connected, all incoming data gets passed on to a publish and subscribe messaging service so that data is stored temporarily and multiple consumer can subscribe to it. Another advantage of such an in-between service is to replay and reprocess consumed data in case of any errors or changes along the downstream logic. For this is used the Azure Event Hub cloud service which shares the same semantics as Apache Kafka, the choice for the self-hosted environment. By sending data to the cloud, this is a kind of gateway where you can control the size and number of messages sent which may effect costs and latency. For instance, sending multiple events as small batch messages is often cheaper than sending each record separately whereas gathering data means a certain delay.

Once the data is queued in the pub/sub system, the Apache Spark Structured Streaming API receives the data and persists it in an efficient way, more regarding this in the next sections. At this point, data can also run through trained machine learning models in real-time. All this can be achieved with Azure Synapse Analytics on Azure which allows to easily spin up an Apache Spark pool and scale it according to your needs.

Maybe you ask yourself at this point, why don’t use the Azure Event Hubs Capture feature instead of the Spark streaming job which is doing all that persisting and archiving automatically for you just by doing some clicks. If you make usage of this feature be aware that you cannot change the partition strategy (Year, Month, Day) and data is stored in AVRO format which is a good fit for messaging systems such as Apache Kafka for serializing and deserializing data, but its row-based nature is not the best option to drive analytics on top.

Data Storage

Therefore, the result of the Apache Spark streaming job is written in a file format appropriate for analytics and partitioned in a way to best satisfy upcoming query patterns. Of course, data could also be written to any other downstream systems or even back to Azure Event Hub or Kafka. However, the Hadoop Distributed Filesystem (HDFS) makes it cheap to store and distribute large amounts of data. Furthermore, Azure Data Lake shares the same semantics and can therefore be used interchangeably.

Coming back to the importance of the chosen file/table format appropriate for the upcoming analytical use cases. There are already various good formats around such as Delta, Hudi or Iceberg. Along with the Hive metastore in the background, all these formats try to solve problems that have been around in traditional data lakes for a long time such as schemas and ACID transactions. They also offer many I/O optimizations such as compression due to their column-oriented storage method and hence scale much better than traditional CSV files. Storing data this way, makes it hence a much better and cost-efficient fit for analytical queries (OLAP) which not always require to read the entire row as this would be the case in the row-oriented AVRO file format. The best table format to choose in this architecture is Delta which builds upon Parquet and is compatible with many data platforms.

Data Serving

The above triggered Spark streaming job is supposed to run infinitely and continuously passes on incoming streaming data through the process. The outcome gets directly persisted in a hierarchical folder structure which gets created automatically based on the chosen partition columns as mentioned above. It is now possible to set up a so-called unmanaged or external table on top of the root of this folder structure using Spark SQL. In case of such an unmanaged table, Spark only manages the metadata, while you control the underlying data. This gives you the flexibility to store the data at a preferred location and it is also useful by integrating preexisting data as well. If the table gets deleted, Spark only deletes the metadata and the underlying data stays untouched. For this can also be used the before mentioned Apache Spark pool as a part of Azure Synapse Analytics and Apache Spark for the self-hosted approach. In the cloud version, unmanaged tables created with the Apache Spark pool get automatically synced to the Serverless SQL pool of Azure Synapse Analytics. This would allow to query data even if the Apache Spark pool is turned off to save money. However, at the time of this writing, this is only supported for tables based on Parquet, but not Delta.

The importance of partitions is that by queering data the number of directories and files read can be limited drastically. Spark pushes down the filter predicates involving the partitions. For instance, using date as the partition key, then filtering the query via a where clause and a specific date, then Spark reads only the data files within the folder matching the given date. Spark SQL hereby act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, external SQL speaking applications can interactively query data on-demand on top of the files through the created table. A good partition strategy can further reduces costs significantly in case of cloud storages as they usually charge for read transactions. That means, the less data needs to be scanned to return a query result, the faster and cheaper.

Data Visualization

To consume data and to drive the actual value, it makes sense to use the Power BI service in case of the Azure deployment. It integrates very well with the entire Azure cloud platform and allows to directly connect to your Spark tables providing a live view of data. You could also import data into Power BI which then provides much better response times as data gets stored within a high performance in-memory analytical storage (VertiPaq) made for interactive analysis. However, this depends on the data volume and the required freshness of data.

A very lightweight and feature-rich open-source option would be Apache Superset. It provides a lot of different visualization types and it allows you to write your own SQL queries via a nice GUI and hence makes it a perfect for data exploration and ad-hoc analysis. Among many other cool features, you can define custom dimensions and metrics, share dashboards, set notification alerts and scheduled reports. Both tools also provide intelligent caching mechanisms.

Enhancements

Kappa Architecture

So far, we have just focused on processing streaming data, but what happens if also batch data needs to be integrated? No problem, since Apache Spark Structured Streaming unifies the processing of streaming data and batch data by reusing exactly the same code. There can also be joined streams against historical data at rest. This architectural approach is called Kappa which addresses some of the pitfalls associated with Lambda such as reusing the entire code-base. However, Kappa is actually a streaming-first architecture deployment pattern and hence not considered as a replacement for Lambda since not all use-cases can be migrated. Apache NiFi can further also be used to connect databases or pick up files from any storage which can then be passed on to the pub/sub system or persisted directly on the distributed file system.

Data Lakehouse

The main goal of the proposed architecture is to persist real-time data in its raw form to enable advanced analytics use cases and ad-hoc querying. However, what if you want to build an entire data warehouse to drive interactive operational reporting on top rather than just occasional ad-hoc queries? If so, you should consider the Bronze, Silver and Gold approach as proposed by the Databricks’ Delta Lakehouse, known under the medallion architecture. The idea is to refine the data through the different layers to allow different personas, such as Data Scientists and BI Analysts, to use the correct up-to-date data for their needs. While the Silver layer is kind of a enriched staging area based on the raw data of the Bronze layer, the Gold layer is made for reporting in form of an aggregated version which is further often separated into dimensions and facts. Databricks recently announced the Change Data Feed (CDF) feature in Delta Lake that makes this architecture and the journey through the different layers much easier.

Push Dataset

Once the data is persisted, everyone can look up whatever data they want. In the best case, dashboards have configured a certain refresh interval to get automatically updated. However, in this case, data is always pulled from the storage which could cause long response times or also high read transaction costs depending on the underlying storage including partitions as already mentioned above. On the other hand, pushing data allows to automatically refresh the dashboards as new data is coming in. For instance, the Power BI service includes such a functionality which allows to use all the report building features of Power BI such as normal visuals, data alerts, pinned dashboard tiles, etc. This further ensures that everyone involved is aware of a set of core numbers or data points and doesn’t have to ask for it. Pushing data to Power BI can be achieved through various different ways such as via a REST API using either Apache NiFi or Apache Spark, or through a direct integration with Azure Stream Analytics.

Conclusion

The cloud computing benefits certainly make a lot of things easier, especially when it comes to hardware setup, configuration and maintenance. By setting up distributed systems in local environments, it is a good idea to consider container technologies such as Docker and Kubernetes. They allow to move faster, test and deploy software more efficient using standardized configurations. However, running distributed systems stable and resiliently in containerized environments still requires a considerable amount of knowledge. Therefore, you can get started with the proposed approach on the cloud and choose then whether to recreate the architecture in a local environment or to continue profiting from the convenience enabled by the cloud. This might also depend a little bit on the size of the project, meaning that the bigger the project, the more it would make sense to build your own environment. Integrating just a few streaming data sources might not justify the effort and expanses put into staff and infrastructure.