Decomposing the Turo Monolith: Activity Feed

Andre Sanches
Turo Engineering
Published in
6 min read6 days ago

Challenges, results, and lessons learned from this backend feature extraction project.

Introduction

Like many startups, Turo built its backend system into a single application backed by a single database — this is known in the software industry as a monolith. New features are constantly added to this application, substantially increasing its size and complexity.

Over more than a decade, this monolith has accrued enough complexity and tech debt that maintaining it became a significant effort and risk. With Turo’s growing popularity, a challenge we are addressing is scaling our backend to keep up with the increased traffic.

Not all traffic is the same, however, as different features in the application have different traffic patterns. For example, traffic for vehicle search is orders of magnitude larger than for booking a vehicle, or for a simple profile update. Since all features are built into the same codebase, when one specific part of the application experiences a spike in traffic, all other parts suffer with higher latencies resulting in overall poor customer experience.

For a couple of years now, Turo engineers have been hard at work to decompose our monolith into microservices with distinct load profiles that scale individually. Features like vehicle search, conversations, and payments were already decomposed in previous projects with excellent results.

This blog post details our journey on decomposing the Activity Feed feature.

Activity Feed

Push notifications to mobile apps are ephemeral in nature. A notification disappears when you either tap on it or dismiss it altogether.

Turo sends important notifications with promotion codes, links to bookings, etc, which our customers may need to reference long after the notification widget is gone. The purpose of the Activity Feed feature is to keep a history of all such push notifications and present them to our customers in a user-friendly way.

Activity Feed view, under the Notifications tab in the user’s Inbox
The Activity Feed user interface

Traffic for this feature accounted for upwards of 15% of the average write load on the monolith database at any given time, with observed peaks as high as a staggering 45%.

P99 query latencies perceived by average volume customers were between 250 ms and 450 ms. For high volume customers, with dozens of vehicles listed, queries were even higher causing frequent timeouts. This was exacerbated when other features in the system were also under high load.

Microservice Extraction

The project was kicked off with research to establish key metrics that quantify the problem space which the new system would improve upon.

We created a dashboard that tracked traffic load, latencies, and the correlation thereof for both write and read workloads. For latencies, the focus was on the 99th percentile to ensure we’re addressing even the worst case scenarios.

Read throughput and latency
Write throughput and latency

Another key metric tracked was the load impact on the Database cluster for reads and writes. This is because when under heavy traffic, during large marketing blasts, activity feed writes comprised a staggering 45% of the entire database load and affected other unrelated features of the backend system.

Database time slice load of activity feed reads and writes

Defining these key target metrics helped us design a new system that can handle the observed load and traffic patterns.

Enter the Notifications Microservice

We created a new microservice to extract the Activity Feed logic into. The near-term plan for this service is to encapsulate logic for all notifications, not just the Activity Feed, so we called it the Notifications Microservice.

Three significant aspects of the Activity Feed feature were considered when designing the new system:

  • Write traffic has spurious spikes at unpredictable times.
  • Activity data is not time critical — no need for strong consistency.
  • Persistent data has no immediate relation to any other tables in the monolith database, making it a prime candidate for migration to a non-relational data store.

In order to gracefully absorb spikes in write workloads, we chose a queue-based approach using SQS. This effectively makes data writes eventually consistent, which fits the second aspect listed above.

We chose DynamoDB for data persistence, and crafted a table schema with a composite partition key strategy that optimizes for query operations.

Reads are done via standard REST calls, with reusable HTTP connection pools to prevent TCP handshake delays.

Below is a simplified component diagram for the new service.

Activity feed component diagram

Composite Partition Keys in DynamoDB

Turo customers can have two roles: Guest and Host, each with distinct activity feeds. Metrics for read traffic showed that queries were divided almost evenly between the two roles, suggesting this could be leveraged when defining the Partition Key for the table schema to increase the probability of well balanced partitions.

The challenge is that DynamoDB does not natively support composite keys — there can only be a single field for the Partition Key — so we had to get creative.

We solved this by producing composite values by way of concatenating attributes with a delimiter character, i.e. {ATTR1}#{ATTR2}#{ATTR3}#.... The delimiter prevents collisions in the event that two concatenated numeric attributes result in the same value — i.e. without a delimiter, concatenating values 123 and456 would collide with concatenating 12 and3456.

Composite keys are used for both the main table and Global Secondary Indexes.

Examples of composite keys.

Results

The system was rolled out on June 10th, 2024. Below are the observed improvements, perceived by customers when navigating to their activity feed views:

  • P99 latencies dropped by a staggering 71%, from 450 ms to 130 ms.
  • Peak average latencies dropped by 45%, from 90 ms to 50 ms.
  • Far more stable latencies regardless of traffic, resulting in predictable query responses at any given time.
  • Queue-based write workloads gracefully absorbed all spikes in traffic with zero impact to the underlying systems.
  • Completely offloaded the monolithic DB a few days after the rollout, resolving any future interference between the two.
P99 query latency drop
Average query latency drop
Daily ebb-and-flow of traffic
Monolith database offloaded on June 17th, a few days after the rollout.

Key Takeaways

  1. It was hugely helpful to kick off the project with a data driven approach, establishing key metrics to be pursued enabling us to clearly measure the impact of the new system.
  2. Understanding traffic patterns helped design the innovative composite key approach, bringing sensible benefits to the service scalability.
  3. Eventually-consistent data writes unlock massive scalability and should always be used unless strong-consistency is a critical aspect of the system.

Thank you for reading! Feel free to reach out if you have any questions.

If you enjoyed this post and are interested in putting the world’s 1.5 billion cars to better use, Turo is hiring!

--

--

Andre Sanches
Turo Engineering

Opinions are my own and not the views of my employer.