Upgrading Kafka at Strava

Danny Schofield
strava-engineering
Published in
4 min readDec 18, 2017

About Me:

My name is Daniel Schofield and I am a Senior at the University of Maryland. I have had the pleasure of interning at Strava on the Infrastructure team for the past 3 months. While I am an avid runner that has used Strava for a while, my interest in working here was also due to my interest in seeing how Strava engineering projects work under the hood.

Interning:

For the first couple weeks of my internship, I was assigned several different tasks to familiarize myself with Strava’s codebase. The biggest of these projects was making updates to the matched runs service, Toucan. This was a great start to my internship. I could immediately see the product impact my changes made, while becoming more comfortable at Strava. I was then assigned my main project — implementing a new system for publishing events with Apache Kafka.

Project Background:

Kafka is a distributed streaming platform, that lets you publish and subscribe to a stream of records. Kafka is heavily used at Strava — it logs the majority of events that happen on the platform, such as changes to activities, segment efforts, and kudos. It also plays an important role in logging client behavior events.

When I arrived at Strava, the majority of events were produced by our Rails front-end that serves API and web traffic. We wanted to improve how we published events to Kafka. Most importantly, we were producing events using an unmaintained Kafka client — poseidon. In addition, as Strava has grown, we have seen a growth in services that want to produce events to Kafka. When upgrading Kafka versions, the broker needs to be updated, followed by the clients. As the number of producers grew, it became more tedious to make all of the associated upgrades. Event logging is very important to many pieces of Strava’s products and we could not tolerate a loss of events, even when updating our logging system.

Theoretically it is easy to upgrade Kafka brokers, but it in practice it can be difficult. For example, attempting to upgrade the Kafka brokers, while supporting older clients, resulted in an increased web error rate — the Kafka client we were using, poseidon, did not handle reduced availability scenarios well. We also wanted be able to use the Kafka official API in Java. Overall, when I arrived there were a multitude of reasons to restructure the architecture of event logging to Kafka.

The Solution:

Several solutions to improve the way events are logged were explored, however, a service as the solution became clear. By putting event logging behind a service, we were able to leverage Thrift client code already in the current Rails codebase. This code easily allows for retries and other settings to be configured when sending requests to the service.

The design of the service focused on creating a single Kafka producer that exposes a Thrift endpoint. Rails first forms a request that includes a batch of events. When the service receives these events, it logs them to Kafka. This architecture is simple and allows for the Java Kafka client to be used when publishing. Additionally, any Thrift client can be used to publish events. Finally, having a single producer makes it easier to update the Kafka version and configuration — with a single event sink.

Through this standardization of a service, Kafka is accessible to any new services that want to produce events with a simple Thrift request, as opposed to having to handle Kafka client configuration. This service can be enhanced to handle extra features such as message validation and any additional logic that could be introduced in the future. The service was then deployed using Marathon to be able to leverage other pre-existing infrastructure such as autoscaling.

Lessons Learned:

Being able to own the development of my own service was helpful for growth as an engineer. Creating my own service allowed me to learn about the open source code that we use at Strava. For example, the application is deployed on Marathon with Docker containers, which gave me insight into container orchestration. Strava also uses linkerd and Zookeeper for load balancing and service discovery, in addition to using Graphite for metrics collection. One of the best things about developing this service was that I was able to gain a sense of how these different technologies interact with one another. Working at Strava, I’ve had the opportunity to learn about technologies I didn’t know existed before I arrived.

Overall Experience:

Interning at Strava was an amazing experience and I am extremely thankful that I had the opportunity to work here. I was able to learn about software and best practices in the industry. It was great to see that even though people have strong opinions on the product, they are still able to make data-driven decisions that focus on what is best for the athlete. While here I also had the opportunity to run my first marathon, California International Marathon, and felt extremely encouraged by the positive culture that is prevalent here. I felt important working on a real project that is deployed in production and actually has an impact on Strava. I appreciated the iterative development that allowed me to grow and become a better developer and a better person.

I would like to thank Jeff Pollard for being my technical mentor, Steve Lloyd for being my manager, and all of the Infrastructure team for supporting me throughout my time here. Go Terps.

--

--