Localz meets Kafka: message durability and latency

Jesse Cameron
Localz Engineering
Published in
5 min readApr 26, 2018
kafka loves localz, and localz loves kafka

At Localz our SaaS ecosystem runs as a fleet of micro-services. Lots and lots of containers running on cloud VM’s. We needed a way to manage the communication between them.

Enter Apache Kafka. Kafka has been around in industry for a while now. So if you aren’t familiar I would suggest checking out these amazing blog posts.

As a refresher, it can be helpful to think of Kafka as a large distributed log file (as in distributed across many machines). On a new message being received it is appended to the end of said log file. There’s some additional magic that allows data to be partitioned. But that’s a bit beyond the scope of this blog post.

the owls are not what they seem

Kafka at Localz

Localz is a micro-location start-up that delivers thousands of little parcels of happiness everyday. Consequently, our system runs hot with drivers updating their location and customers requesting updates thousands of times a day. While a common usage of Apache Kafka is for it to act as an event sourcing mechanism, here at Localz the use case is a little different. We aren’t interested in using it as a log of the events but more of an inter-service messaging platform. Facilitating the communication of all events between APIs. Still, it is fair to say that the service forms a key backbone of our infrastructure.

However, this imposes some interesting constraints on our cluster. We need to ensure that all our events will reach their destination. This means at-least-once messaging semantics. We also need our end-to-end latency to be sufficiently low, preferably under 2 seconds. If our drivers and customers aren’t sent updates within a few seconds. Neither party will be happy. Yet, those with a little bit of experience can tell you that building a blisteringly fast messaging solution that can also guarantee delivery is a little at odds with itself. But Kafka’s configurability can offer an alternative to this.

Tuning Kafka

Let’s talk about how Kafka relates to CAP theory. If you have multiple nodes in your cluster, it will handle partition tolerance through it’s internal consensus algorithms. So the decisions falls on whether the service should be favouring Availability or Consistency. Back tracking for a moment. You can achieve consistency by locking all the nodes before enabling consumer reads. And availability can be gained by allowing reads on the node before all are updated.

Fig 1. Visualisation of production client settings in relation to CAP theorem. Note: This is in the context of having a Kafka cluster with a default replication factor of 3.

One of the strengths of Kafka is that you can tailor the CAP emphasis at a per topic level. For example, you set the number node acknowledgements you want to receive before considering a message committed. If you desire full consistency you can wait for all nodes to return an ack (acks=all). Or if you want to favour higher availability, you can simply wait for a single node to acknowledge (acks=1) or none (acks=none). This is useful in the case where you require some topics in your Kafka cluster to be tuned for durability [1] and others for low latency. As these setting are all passed through at client side production. It empowers our developers to make their own decision based on their use case. Without submitting a ticket to team responsible for the Kafka cluster.

Fig 2. A simple P.O.C. latency test. Single topic replicated across three t2.medium ec2 instances in seperate AZ’s. Kafka v1.0.0. [2]

As you can see from Fig 2, the stronger your durability constraints, the larger the latency. For most of our topics we hit the middle in an attempt to compromise. Opting for acks=1 and min.insync.replicas=2[2]. This ensures that there is at the least some form of write completed before acknowledgement is returned to the producing client. Alternatively for messages that we are especially concerned about the delivery of, we have the ability to set acks=all . Meaning that the the message may take slightly longer to reach it’s destination, but in the event of the cluster loosing a node, the message won’t ever be lost. For messages where latency is our main concern, then we only require the first node to acknowledge the message write. This does introduce that potential of message loss though. In a situation where the first acknowledging node fails before the message is replicated to any the other following nodes, you will lose that message.

Addendum

It’s worth taking a moment to mention that there currently isn’t a way to fine tune network and processing throughput at a per-topic level. In some cases, where teams Kafka usage is sufficiently high. They’ll often run two separate clusters. One for messages that are considered high priority. And another for messages that aren’t as concerned about latency. Allowing them to chew through large amounts of bandwidth with reckless abandon. As our usage isn’t yet at this level, for now the in built controls will suffice.

At Localz, Kafka has ultimately become a critical piece of our infrastructure. But it wouldn’t have been as powerful without it’s inbuilt flexibility and delivery semantics. On top of this the ability to write our own producing/consuming client APIs certainly makes it a great tool.

Thanks for reading!
And thanks Hugo and Brett Rann for the feedback!

Footers

[1] you may be wondering why I used the term durability and not reliability. I see reliability as the chance of a service not failing. Whereas, durability is the chance of a service performing it’s expected output even if there are system failures.

[2] It’s worth taking a moment to discuss how min.insync.replicas effects the time to acknowledge. In short it doesn’t. The offical documentation can be slightly confusing, but if acks=all, Kafka will wait for all of the nodes to commit before returning an ack regardless of the min.isr. but, it does still effect your durability. If the number of nodes insync with the leader is less than min.isr, then the client will fail to produce. Which still effects the consistence and durability model.

--

--

Jesse Cameron
Localz Engineering

Entry level boy attempting to write decent source code and prose.