This summer, I helped implement streak challenges at Strava by working on the new challenges infrastructure as an intern on the API team. This entailed implementing and deploying two new Scala services, which introduced me to several cool things like Scala, how Kafka works and how to deploy a Docker image with Mesos. However, it was particularly exciting to be on this project because my team was in the unique position of being able to pilot and iterate on best practices for microservice development at Strava.
A Little History of Scala Services at Strava
Although Strava has been migrating from a monolithic Rails app to a microservices-oriented architecture for the past couple of years and now has around 50 services written in Scala, the first services were primarily written by a small handful of engineers and initially targeted a limited set of use cases. As the number of engineers in our Scala ecosystem has continued to grow, and more complex services are emerging, there has been a need for developing conventions and best practices.
My project was in a position to influence these best practices because of its technical complexity and the timing of its conception. One service I worked on is responsible for performing and persisting calculations of arbitrary stats. This stat calculation service powers the other service I worked on, which handles challenge creation, and determining an athlete’s progress in their joined challenges. Both services read from and write to multiple data stores, with operations in one service often being triggered by operations in another.
Because these services are more sophisticated, we had more opportunities to design more complex solutions and experiment with design patterns not previously used in older, simpler services. Moreover, our project unfolded right as our platform vision for Scala services was converging to a consistent development practice. As a result, the design decisions we made were early adopters of this platform and were able to validate and help refine it.
New Services, New Practices
The following anecdotes summarize some of the new engineering practices we tested and introduced when working on the stats and challenges services.
Finite State Machines
One thing my team spearheaded at Strava was the use of a finite state machine (FSM) in a microservice. As I mentioned, one of the services I worked on is responsible for managing the life cycle of an athlete’s challenge progress. The myriad of different states the athlete could be in with respect to a challenge convoluted this task — athletes could be unjoined, joined, completed, not completed, or completed but manually by a Strava support representative. Moreover, since streak challenges are represented to users as completing a series of milestones, we had to distinguish challenge progress that completed a milestone from progress that did not.
We chose to model the problem with a finite state machine because the semantics of the service were a perfect fit. An athlete’s journey while in a challenge can be represented as a possibly cyclic sequence of unique states — unjoined, progressing, and completed — with nuanced conditions for transitioning between those states. Our adoption of this paradigm ended up being extremely helpful for identifying edge cases and limiting undefined behavior for an athlete’s state during a challenge. For example, one of the use cases we were asked to support was completion of a challenge for athletes by a Strava support representative. The FSM helped enforce that an athlete could not have the challenge force-completed for them if they hadn’t joined the challenge.
Additionally, the FSM cleaned up CRUD semantics of our service by restricting read and write access to the challenge stores based on the state of the athlete. For instance, the FSM implementation naturally prevented clients from reading an athlete’s challenge progress after they had left the challenge.
Codifying the FSM first involved creating and iterating on the state machine diagram until our team was satisfied that we had covered all potential transition events. We then accessorized the FSM to suit the required functionality of our service. For instance, we defined an athlete’s challenge progress in the FSM as an object with a state that reacts to events by performing the necessary CRUD operations, potentially transitioning to a new state. Other properties of the FSM were informed whether a transition was possible, as well as what database rows needed to be updated. For example, transitioning from a Progressing state to a Completed state depended on the challenge configuration (e.g. three activities a week for three weeks), and the rows updated to carry out the transition depended on the athlete in the challenge.
Our implementation funneled every write to the challenge progress store through the FSM, which meant that no progress row was ever in an unexpected and unknown state. Finally, because our service publishes events when challenge progress values are updated, we added a callback to our FSM that would trigger the event publishing component of our service when writes to the progress occurred.
Designing Clear Error Responses
Another issue our team ran into as we implemented the stat calculation and challenge services was how to communicate errors to clients. Specifically, we were returning empty responses to clients when a read or write to one of the service’s underlying databases was not possible, usually due to an invalid request. For example, I implemented an endpoint that forces recalculation of an athlete’s challenge progress. An empty response in this case could mean that the athlete had not joined the challenge or that the specified challenge did not exist. This was not ideal because the client would likely want to handle these errors differently. If the athlete had not joined the challenge, the client could add the athlete to the challenge and re-try; if the challenge didn’t exist, the client might have sent a malformed request and re-try with a different challenge.
We wanted to provide the client with the specific context of their failure beyond just changing the logged error message to enable flexible error handling and minimize confusion. Our interface definition language, Thrift, allows us to define custom exceptions, but we were hesitant to use them because we thought they counted against our error metrics. Otherwise, an accidentally malformed client request with the wrong athlete ID could alert an engineer on call in the middle of the night.
After doing some research, we discovered that Thrift exceptions actually counted as successful responses from the server and failures for the client. We were then able to define different Thrift exceptions for each type of empty response when processing a request and return those to the client instead to handle however they want. Addressing this problem in the streaks project helped drive more concrete error-handling guidelines for services with Thrift interfaces, and our services were the first at Strava to take advantage of Thrift exceptions in this way.
Centralizing Database Access
Another issue encountered while implementing the challenge deletion use case was the need to delete rows in multiple tables transactionally, which required the deletion method to have access to multiple database tables. This was a problem due to the way we modeled access to the tables in our application, which removed direct access to the table, making SQL operations which combined multiple tables impossible. The deletion logic required deletion of the challenge configuration and the challenge milestones, which were stored in separate tables and managed by separate store classes. The first iteration of this method was implemented in the challenge configuration store, so we had to grant the config store write access to the challenge milestone store, breaking encapsulation.
Our solution moved the deletion logic to a store manager class that managed access to all the stores in the service. This meant refactoring our server logic to make all database calls through the manager and moving the deletion logic (the only method as of now that writes to multiple databases at once) to this class. Semantically, it made more sense for a store manager to have write access to the stores it instantiates than for one store class to have write access to another.
Finally, we adopted a practice that the infrastructure team recently began for production services — ToYS, or Turn off Your Stuff — which entails literally shutting down all the server instances of your service and observing what happens. The purpose of ToYS is to ensure that we know how clients of a service will behave when it is down and can mitigate the impact of server failure. In the last few weeks before streak challenges became public, we turned off the challenges service and found that the empty responses we were returning broke the challenge gallery page on our website and mobile apps. The team who owned the challenge gallery was then able to improve the error handling on that page to avoid that particular side effect should our service go down unexpectedly in the future.
I really appreciated the opportunity to be part of an innovative effort as an intern, especially on a project with as tight of a deadline as we had. It’s really important to me to be on a team that prioritizes best practices, and the initiatives we took while developing the backend infrastructure for streak challenges were clearly illustrative of that quality. Beyond this, I also loved being treated like a full-time employee; I worked off the same backlog as my manager, got assigned high priority tasks, and was trusted to be able to accomplish things I had no prior experience in. I also developed an appreciation of Scala during my time at Strava. I’d never worked with it before but found that it was very easy to move quickly after developing even a minimal level of proficiency with the language. As the following graph illustrates, the rate at which we were able to accomplish tasks grew along with the team.
A lot of the logic we wrote involved chaining different processes together, and the prominence of Futures and flatmaps in Scala made it easy to reason about concurrent programming. In addition, we were able to leverage unit test libraries similar to Java’s that made verifying code correctness very painless. Finally, I am super lucky to have gotten my code reviewed by some incredibly smart engineers who helped me feel comfortable and productive in my first experience working with distributed systems. I’ll miss Strava and the people I’ve worked with a lot and am excited to see streaks launch this September!
Many thanks to Mindy de Hooge (my manager) and Zack Isaacs (my mentor) for coaching me on service development and life, to Lulu Ye for teaching me about AWS and infrastructure at Strava, to the rest of the Streaks team: Jeff Pollard (pilot of the FSM), Mike Kasberg, and Noa Levi, and to the rest of the API team: Yudi Fu, Mateo Ortega, and Rachel Harrigan. I learned a lot from y’all — thanks for a fun and rewarding summer :)