Experiences with Kafka and exactly-once processing in IoT apps
Some context on message brokers and delivery guarantees
(If you have fair amount of experiences with message processing and delivery guarantees please skip to the next part of this post.)
Message delivery guarantee is one of the canonical requirements for message brokers and they are very relevant for all types of brokers: the ones based on queue semantics and the ones based on Pub/sub semantics. There has been lot of questions floating about Kafka’s support for exactly-once delivery guarantee. [tagline] The scope of this post is to examine if we can offer efficient, end-end duplicate events processing given the current feature-set of Kafka especially for some breed of apps [/tagline]
Lets look at in 2 contexts: In the context of distributed non-IoT apps and in the context of Web-of-things or Internet-of-things apps. (Just hold on to that and I will refine this scope further and make it even more crisp as we move forward).
But before starting let us refresh our memories on the basics of delivery guarantees: At least once, At most once, exactly once are the 3 guarantees available and the definitions are below.
- At most once: The consuming client may or may not get a message from the broker.
- At least once: The consuming client may get a duplicate of message but never miss one.
- Exactly once: The consuming client will get a message and will get it only once.
As you can see, the choice of the message delivery guarantee and hence the message broker depends on the type/domain of your application and/or user experience. For instance, you will need stronger ‘exactly-once’ guarantees for a banking application processing debits and credits where duplicates cannot be tolerated, whereas for a m2m application where remote sensors report telemetry readings like location (lat, lon) to the server once in every few second samples, its ok to miss out a couple of events, because you could always trace the path from other available (lat, lon).
Message brokers and Delivery guarantees for IoT apps
Until the IoT apps came along there weren’t a lot of concerns raised related to possibility of duplicate messages being generated from message sources. Because typically message sources are either other systems trying to talk or remotely deployed devices like smart phones or sensors in the old school m2m installs. But in all the above cases the message sources still were part of a reliable network, are capable of speaking IP protocol (some flavour of TCP or UDP) and are not constrained by any means. Hence the chances of duplicate messages due to network issues is virtually non-existent.
But in the case of IoT applications things changed. The telemetry information is reported by remote ‘things’ which are part of a unreliable (flaky) PAN (personal area network) and the ‘things’ we are talking about are constrained sensors, In what protocol they can speak and how they can reach to the internet as shown below. Due to the nature of the network, the design and the architecture of the IoT apps should take these constraints into consideration and design for duplicate messages. (Now, before I start, I think it will be accurate to call it as an ‘event’ as opposed to a ‘message’.)
[caption id=”attachment_1149" align=”aligncenter” width=”300"]
Image credit: IoT magazine[/caption]
Ok, lets talk use-case
Kafka forms the heart of the most modern streaming data platforms, lets focus on a streaming data platform of an IoT/m2m app. I would like to show you the overall flow of an use-case which I implemented to demonstrate that a message broker offering an ‘exactly-once’ delivery doesn’t really solve the problem of duplicate events. For simplicity lets start with the case of m2m apps. (As I said before, the key difference between m2m and IoT is that in m2m installs, the remote sensors installed to collect telemetry data aren’t really constrained in anyway unlike in the case of the ‘things’ in the IoT installs).
We built a data platform for an m2m app. As shown below the data flow is straightforward, multiple remote sensors were ingesting near-real time events for processing. (Even though its an m2m app the same device gateway, marked as IoT gateway in the figure below was used)
Its business-as-usual, events are sent to the IoT gateway directly (We used AXEDA, now Thingworx) and the gateway code (a Groovy script) called the “Events API” to push the events to Event firehose (Kafka) and sends an ACK to the IoT gateway. The gateway in turn sends the ACK to the respective sensor. The sensor is programmed to resend events until it receives an ACK for each event.
So when for some reason if the ACK fails to reach the sensor, it starts re-sending the same event for a preconfigured number of times. There is no way the system knows that these are duplicates, especially Kafka.
This use case proves that expecting exactly-once guarantee from one component i.e message broker in large distributed apps supporting IoT/m2m doesn’t make sense as it cannot solve a system-wide problem like the one shown.
Now lets extend this use-case to more complicated IoT installs with a little background on the ‘things’ in IoT
As mentioned earlier in IoT installs things’ are weaved together into a remote personal area network (PAN) capable of talking with a low foot print protocols due to their constrained nature. As a best case scenario they may use a Zigbee or UDP variant like CoAP. some advanced installs do use modern protocols 6LoWPAN. Brushing away the topology details these ‘things’ chirp using a store and forward mechanism. As such these things don’t have the ability to talk to the Internet directly as shown in (figure 1). Hence they forward the events to an IoT device or bridge which bridges the PAN to WAN. Here are couple of different IoT topologies.
As for as IoT installs, I will leave it to the your discretion to imagine what are the possible reasons that can complicate things for a message broker to guarantee an exactly-once semantic.
Ok, Don’t get me the wrong way, Kafka is a great product, I have used it in production. It is very popular and in fact I cannot imagine a full-blown streaming data platform without Kafka. The commercial offerings of Kafka like confluent is trying to position Kafka as the go-to product for building streaming data platforms in IoT installs, I just wanted to chip-in and share my perspective on why for certain kinds of apps Kafka alone cannot solve the problem of duplicate events by offering a ‘exactly-once’ delivery guarantee only during consumption part. Some time back I wrote about achieving ‘exactly-once’ delivery guarantee in a streaming data pipelines, here it is Exactly once semantic with a Kafka and storm integration. I proposed to ‘finger print’ every event to identify and avoid duplicate events using deterministic / traditional data structures say a distributed <K,V> store. I do understand that it is exactly for these kind of problems the sketching / probabilistic data structures like bloom filter were invented. But I have contentions in using probabilistic data structures for de-duplicating events in streaming pipelines and I will write another post about it. But when I proposed the finger printing solution for ‘exactly-once’ guarantee, not lot agreed or understood the rationale and I hope this post justified the need for an end-to-end solution to offer ‘exactly-once’ guarantee.