What is Microsoft Azure Event Hubs?!

Sreeram Garlapati
7 min readMar 16, 2019

--

— a reliable, pub-sub data stream

What is Microsoft Azure Event Hubs (or similar systems like Apache Kafka or AWS Kinesis)?

In short, Event Hubs is a partitioned, reliable, ordered, pub-sub data stream.

  • partitioned — to support highly-available sends and parallel readers
  • reliable — agnostic to failures, zero data loss
  • order of data is preserved within the partition, not across partitions
  • topic-based publish-subscribe, implies, no message filtering
variety of data sources producing data-to, & data sinks / analytics engines consuming data-from Event Hubs

Too complex a definition! Hmm, we say that the stream is partitioned & then ordered, & reliable and that, it also supports the notion of publish-subscribe!

What is all of this? Where is this needed? What is this invented for!?

Well, then, lets try answering this question & in the process, may be you will invent one such data stream too!

Let’s say, there’s an online retail store’s website on which, customers explore the product specs, navigate through the products, compare several competing product specs etc., & of course, eventually buy’em; the website could be running on a couple of servers —to a couple 100 Or 1000’s of them. Now, think about designing a system to:

  • detect & alert the admins — if any critical errors happen on the web sites; for ex: in a 10min window, on >25% servers, sites are crashing (application monitoring)
  • alert, if there are suspicious logins — for ex: in the last 5min window, many customers tried to login from a different country than they usually do, and then, purchased above a certain limit (like > $1000) for the first time — may be the website was hacked!! (anomaly detection)
  • show the number of devices sold in the last 5-minute, hour, day, month & year window per category of devices (customer telemetry)
  • notify the customers who bought the devices (say a refrigerator), when the device is ready for maintenance (device telemetry, aka Internet of Things!)
  • for those customers who landed on the website and did not end up buying the product, record where, in the UI experience, did the customer leave the website (click streams)

What is the common denominator of all of the above problems:

a. can all of them be solved locally (local to application logic)?

No! Simply because, the data is just not present there! for ex: take the case of alerting admins based on critical errors from websites — each website instance is only aware of its own errors & even worse, in some cases, it might not be aware of the error at all! for instance, the website might have crashed due to that critical error; now, forget reporting, its no longer even running! Extrapolate on other problems mentioned above as well… you will soon understand that, all of them need some sort of aggregation of data across all the machines — outside the execution boundary of where the data is being produced!

b. all of them need additional resources for processing

additional or precisely their own compute and IO resources; compute — to run code & perform IO — for actions like alerting the admin, which would translate to operations like sending an email — which would need a n/w call.

Simply put, to solve all of these problems, we will need another decoupled system to process/crunch the data which can aggregate data from all Of these producers and, also, scale independently! All good! Now, the question is, how will this decoupled processing system get that data which is produced at various places — machines/applications/websites? Unless, all of these Data producers — SEND that data to this decoupled processing system, some how!

For all of those disparate systems to send data to this decoupled processing system — how would the senders know what data need to be sent, to which processor? Let’s say if they do, then, what would happen if one of these decoupled processors goes down for maintenance? Should the data producers wait until they are up again? Definitely not! We cannot stop the functioning of a website just because reporting isn’t working! Thinking further, having direct communication from producers to the processing systems (a.k.a. the Consumers) doesn’t seem to be feasible here!

What if, that data was some how reliably staged —for this decoupled processing system to read and process in a way that the real application can just push the data to that staged location & continue running its logic & then, the decoupled system is enabled, some how, to read the data in its own pace!

Problem solved!

We call this reliable staged location - Event Hubs! Read on!

What exact features would this staged location need, again?

Now, take a second. Pause and Think…

  • first & foremost — it need to accept data from many publishers — remember, many of these data producers should be able to send to a place per their intent. Take user logins information. All web roles should be able to send to one location, like a user_logins stream — so that, data from all of these different instances, which is related, is available for processing at a single processor, for correlation! This notion of publishing data per intent is commonly referred to as publishing to a topic (btw, this topic is same as the Event Hub name!). ex: take computing aggregations like 25% machines down in a given 10 minute window — to compute this aggregation — all critical machine events need to be available at one processor! Right! So, here comes the first feature — publish to topic!
  • now that the Event Hub has data, what next? Perhaps, yes, we needed this data stream for someone to consume & process it! In fact, that is what we invented this Staged Location for! Didn’t we! So, our natural next feature is that, it should support the notion of consumers interested in reading that specific topic. If, one consumer reads the data is that enough!? No! We saw several cases above, where reading the same data in parallel is needed, i.e., there were different interpreters for the same Data!. For ex: take user_logins stream again; one of the consumers could be an analytics system, trying to compute simple aggregations like number of users logged in — in the last 10 min window etc., and push to a dashboard; while the second consumer could be an ML based system, reading off of the same logins stream & alert if there is an anomaly in the login sequence. Clearly, this dash-boarding solution & the anomaly detection system, which could be completely decoupled systems, need to read the same data at the same time. To summarize, we need these capabilities for the Consumers (a) processors should be able to tell — they are interested on a specific topic for ex: in user_logins information & then (b) multiple processors should be able to read the same data in parallel from a given topic. This notion of multiple decoupled systems being able to express interest to read data from a published data stream — is what we call as subscribers in the pub-sub pattern.
  • & next important characteristic of this staged location is that, it should be highly available for other systems to push data to it! Remember — the actual application logic could be, for ex: in the case of collecting user clicks on product specifications website — is to show the actual product — and the data that is sent to this data stream is the metadata/information on a series of clicks — which will later be used to understand where the user left, without buying! Now, the webpage cannot fail — due to an error that the click metadata cannot be stored — which, in the first place, the user did not even know or care about! That’s why, Event Hubs is partitioned to be highly available for publishers!
  • highly available for senders! fair! Wait, is that enough! What happens if a burst of events landed on Event Hubs and the Consumers cannot keep up! Yes, this system should also allow for Consumers to scale per need — i.e., when application wants to push data at faster rates, then it should facilitate the processors to consume & crunch data at faster rates! To support this, Event Hub allows each consumer to create direct readers to partitions, making it possible for multiple readers to read data in parallel, on the respective Event Hubs partitions! To summarize, partition is the unit-of-scale while reading from Event Hubs. Now you know why it is a partitioned data stream!
  • & cannot have data-loss! once Event Hub acknowledges that, it got the data, it cannot loose the data until the consumers read & process it; can it! reliable!
  • last but not least, it should preserve order — so that the processing system can see the events happening in order — for ex: the data stream has a server crashed event and then 2 mins later health OK heartbeat event — now, based on the order, the alerting system can infer that the server is up & running! so, no alert needed! ordered stream!

Look at you! You just invented the partitioned, ordered, reliable, pub-sub data stream!

Now go ahead and create your first Event Hub!

--

--