One really interesting aspect of our work at Life360 is supporting the flow of location messages from millions of devices. Our current peak message rates max out around 50k/s, which is large enough to be challenging without touching the red line where absolutely terrifying lives. While we are in the process of rebuilding our services to handle much higher message rates, we are working out an approach to scale the underlying message-handling infrastructure as well.
Distributed queues, NSQ, and one or two issues
We depend on NSQ throughout our setup, and get a lot of value from its simplicity and ease of use. It is really simple to deploy and operate, and has yet to show us any capacity limitations. The standard deployment pattern is to colocate an
nsqd with any service that publishes messages, which distributes work very nicely and makes scaling a no-brainer.
However, there is an aspect of its reliability model that can be problematic for us. When a message cannot be processed due to an error or hangup in a Consumer, that message gets requeued for another Consumer to handle. The requeue operation does not preserve ordering, so the requeued messages go to the end of the line for processing. This isn’t often a problem, but it is an issue that we have known we would need to address at some point.
We have been considering a move to an Event Sourcing model, with a distributed, durable commit log. There’s a great set of posts on this subject that I’ve linked down below. This introduces a different scaling model, as the log becomes an infrastructure entity in its own right, capable of handling potentially many different consumer applications. There are some nice properties that come along with such a log, including an explicit ordering of the messages in the log. If a consumer fails and another needs to pick up the failed workload, it is easy to start up at the point in the stream where the previous consumer stopped, without needing to do any special handling of messages to ensure ordering.
That “durable” property yields some other benefits as well, since we can choose how long we want to retain the messages in the log. Being able to look back in time enables applications to retry messages from a certain point in time, or warm up caches, or look back at WTF just happened after a fault. One of my favorite discussions of this is Martin Kleppman’s Strange Loop presentation linked on his blog post on the subject.
There are two candidates we considered seriously: Kafka and AWS Kinesis. In our case, Kinesis won out for ease of implementation and operation. Once we got started, we were able to get a prototype stream operating in a few minutes through the AWS Console, and then back it up with Terraform once we developed a better understanding of the configuration we wanted to run. It was gratifying to get the underlying infrastructure running quickly and turn attention to the messaging challenges right away.
One of the considerations for any team for a new infrastructure component is the availability of client libraries and abstractions. In the Kinesis space, the two key libraries are the Kinesis Producer Library (KPL) and Kinesis Consumer Library (KCL). Both live under the AWSlabs project on Github, along with a bunch of useful code. There are a few links at the end of this article with examples we found helpful getting up to speed on KCL and KPL applications.
With a working test environment in place, the next priority was a full KPL producer, intended to satisfy our production message rates and latency requirements. The best example for what we needed was the java client in the AWS Big Data Blog posting on the KPL. That code required only minor changes to consume data from an NSQ topic and write it to Kinesis.
The structure of the KPL example is simple: the intake-consumer and stream-writer each run on their own threads, and communicate via a BlockingQueue. Using the JavaNSQClient library, the NSQ consumer setup is a short lambda posting each new message to the shared queue:
The producer is a lightly-edited revamp of the final producer from the blog post, to extract our partition key from the message, pass the message to Kinesis, and then report back the success or failure cases to NSQ:
Our final implementation has a few other niceties (collecting/publishing metrics for Prometheus, parameter handling for runtime constants such as NSQ topic/channel and Kinesis stream, etc) but the changes above are the only substantive ones required to work with NSQ.
Our current throughput at message peak is about 50k/s, with a gentle oscillation over the day as the bulk of our users go through their commute hours. Using the KPL’s message aggregation, we never get close to the records/sec thresholds even though we routinely use most of the provisioned stream bandwidth.
The initial implementation focused on operation and correctness rather than latency. Once up and running, our baseline setup was delivering an average delivery latency of ~80ms. Note that since the KPL batches records together, the average may not be a good indication of the distribution. However, average latency is the metric AWS delivers in CloudWatch, making it the easiest measurement to use for broad-brush tuning.
The most obvious parameter to tweak is
MaxBufferedTime, but therein lies an interesting issue. We initially assumed that our message rate would keep the KPL turning over buffers quickly, so we began with
MaxBufferedTime of 500ms. With that setting we were seeing latency for our
PutRecords calls averaging 80ms in CloudWatch. Cutting the value to 200ms reduced our average latency to 20ms or less, which is sufficient for our current needs.
To date, neither memory nor CPU have been an issue for us. We run a trio of
c4.xlarge instances, which hover around 50% of RAM and 40% CPU usage. The cluster may be a bit network-constrained but not worrying yet. The KPL is not particularly memory-hungry, but we do notice that the KPL bandwidth outbound is about 25% more than our incoming. There is probably another set of experiments to figure out optimal tuning and sizing for this application.
We have been running our NSQ-to-Kinesis messaging bridge at production volume for several weeks now, and our full streaming service for a bit less than that. Overall it is working well for us. In a future post we’ll cover some of the challenges we faced as we implemented our streaming service, and share a few gotchas (and workarounds) we stumbled over in our efforts.
Come join us.
Life360 is creating the largest membership service for families by developing technology that helps managing family life easier and safer. There is so much more to do as we get there and we’re looking for talented people to join the team: check out our jobs page.