Scheduled Messages at Turo: Challenges and Successes in Scale

Mackenzie Bligh
Turo Engineering
Published in
8 min readOct 13, 2023

This blog post aims to share insights from building Turo’s scheduled messaging system, as well as provide information about subsequent iterations we have made to further improve the product experience for our Hosts and Guests. Turo uses a Job Scheduler called Quartz to schedule “jobs”. In simple terms, this can be thought of as a calendar that allows us to run specific code at specific times. For example, we utilize Quartz for jobs like sending push notifications, ensuring our data is current, and for user defined jobs like sending scheduled messages. Our Scheduled Messages system effectively functions as a finite state machine, which is essentially a fancy way of defining a flow chart of “states” that an object can be in throughout its life cycle.

Scheduled Messages Finite State Machine

The First Iteration

When we originally built Scheduled Messages, Turo Engineering had concerns about Quartz’s ability to successfully execute what we call “Ad-hoc” style jobs in a timely and reliable manner. These are essentially one time jobs that execute a small task with a specific, limited scope. For example, ad-hoc jobs could be used to start and end a trip on Turo, or to send trip related reminders to users. These types of jobs are ideally suited for small workloads that need to be run at a specific time. Their main drawback is that if they don’t succeed on the first try, it is difficult to identify and re-attempt the backlogged jobs. If issues arise, we have had to abandon ad-hoc jobs that had failed to execute successfully to allow our systems to recover. For this reason, we typically use these “ad-hoc” style jobs for less critical features where there aren’t significant negative impacts for failing to successfully execute these types of workloads.

Alternatively, Quartz provides a different style of job, which can have a larger scope and be more reliable, that is run at repeated intervals. For Scheduled Messages, there is a job that runs every minute (sometimes referred to as a “cron” job) that checks for Scheduled Messages with a PENDING status and a “send at” time that is in the past. The main advantage of this style of job is that it is easy to identify backlogged messages if anything isn’t functioning properly, and then have the job automatically clear that backlog once functionality is restored. The main disadvantage of this style of job is that it is easy for a single message to prevent other messages from being sent unless properly structured.

Our Hosts identified two main areas of concern that they would like to see improved. First, Scheduled Messages sometimes fail to send on time or fail to send at all. Both of these issues had the same underlying cause: using our job scheduler (Quartz) to do the work of actually sending the Scheduled Messages in addition to finding Scheduled Messages that are ready to send. If any Scheduled Message fails to send for a reason our code doesn’t account for, it can cause the Job itself to fail, thereby blocking all messages from sending. The coupling between sending a scheduled message and finding scheduled messages that need to be sent is a flaw that we had to address to make any further improvements to this system.

The Next Iteration

To support the businesses of our Hosts, we decided that we needed to further invest into the reliability of our Scheduled Messaging system. We examined converting the existing system to run on “Ad-Hoc” jobs, but ultimately decided that this would introduce too much load on Quartz and could cause impacts to reliability of features outside of Scheduled Messages as the business in general, and adoption of Scheduled Messages continues to grow. We also examined adopting different workflow engines like Temporal and Copper, which represent a fundamentally different approach to scheduling workloads from Quartz. Ultimately, we decided that adopting these technologies would require a significant engineering effort to adopt, and would take too long to complete due to the immediate needs of our host community.

Our breakthrough came when we realized that the issues we have been seeing are as a result of using Quartz as the method for both the “When” and “How.” By decoupling the job that finds Scheduled Messages that are ready to send from the method of actually sending them we would enhance the reliability and scalability of the Scheduled Messaging system. By using our existing Scheduled Message Job to place each individual Scheduled Message into an Amazon SQS queue when they are ready to send, instead of trying to send them directly with that job. We then use a separate service running in our backend to consume that queue and send the messages, thereby achieving the decoupling we needed in a minimal amount of time. The main benefit of this is that failures in any Scheduled Messages are highly decoupled from sending other Scheduled Messages, making it impossible for one Scheduled Message to block the queue. It has also allowed us to enhance our automatic retry logic by pushing messages that failed into a separate queue. This feature allows us to separately process messages that have failed, and give priority to ensure upcoming scheduled messages are sent on time. Finally, in terms of scalability, we can now horizontally scale the scheduled messaging system automatically as we begin to have more and more load on the system.

As a byproduct of having a function that consumes the queue of Scheduled Messages that need to be sent one at a time, we gain the ability to better instrument and monitor our code, and be alerted more quickly. Reliability, and responding to issues in our Scheduled Messaging system in a timely fashion is an extremely high priority for our on call team and the Host Efficiency Team.

Scaling Issues Arise

As it turns out, implementing a queue for Scheduled Messages wasn’t sufficient to meet the performance level of a maximum 2 minute delay to send a scheduled message. Moving to the “New” system allowed us to take our worst case performance down from 45 minutes late to 8 minutes late. This was a significant improvement in performance, but still not sufficient to meet the expectations and needs of our Hosts.

As we investigated this issue, we came to realize that once again Quartz was the source of the issue. As Turo has grown over the years, we have always relied on Quartz to execute workloads at specific times. Due to a multitude of factors, we realized that Quartz was having trouble running high frequency workloads at the top of every hour as this is when Quartz is most loaded down by other Jobs that utilize it as their workflow scheduler.

Quartz Executions Under Load

From this graph, it is clear that Quartz is struggling to run at higher traffic times of the day. As we discussed this issue with our team, it became clear that we would have to seek an alternate solution as Quartz would not be able to reliably initiate the workflow of sending Scheduled Messages. Thankfully, during this time our “Api Services” engineering team had been working on standing up Temporal (an alternate workflow engine we had elected not to use earlier due to effort required).

Due to our commitment to make Scheduled Messages a highly performant, and first class feature at Turo we began collaborating with the Api Services Team to determine the amount of effort and investment it would take to migrate Scheduled Messages to Temporal. The primary method of performing scheduling with Temporal is through what they call “workflows.” A good way to think of a workflow is a set of tasks that run at specific times, for a specific “item,” in a deterministic fashion. They are similar to how “Ad-hoc” Quartz jobs run, but on a much more scalable external SAAS platform that is solely responsible for orchestrating when the tasks are run, and does not handle running the tasks themselves (a critical lesson learned). However, this presented a problem for us as this is fundamentally different from the “cron” style paradigm we are using for Scheduled Messages currently and would require significant engineering investment to adopt.

An Issue of Timing

As we investigated ways to move Scheduled Messages off of Quartz, we investigated the performance of our “Send Scheduled Messages” job that was supposed to run every minute. We found that as a result of introducing a queuing system (SQS) into Scheduled Messages we found that we could process up to 1800 messages per minute, and are enqueuing around 500–700 messages per minute at present, giving us plenty of room to grow before needing to make any optimizations. As an example, we could easily add Scheduled Messages to the queue in parallel, rather than sequentially, to 2–3x our processing capacity before taking any more drastic steps. With this in mind, we realized that the sole issue we needed to solve was reliably triggering our backend every minute to process Scheduled Messages.

Scheduled Messages Sent Per Minute Under Load

In conjunction with our API Services team, we realized that we could easily set up a Temporal workflow that runs every minute to trigger processing Scheduled Messages, and that we could create this workflow as well as make the necessary changes with a few days of effort.

Results

As a result of these improvements we now no longer permanently fail to send any scheduled messages in a 24 hour period, and are able to successfully resend any attempts that did fail.

Lessons Learned (TL;DR)

  1. Decoupling the “when” from the “how” is critical to improve the resiliency and scalability of systems. It makes it easier to understand code, break things up into microservices, and improves ease of instrumentation. At a more hands-on level, having a function that processes a single item is much easier to instrument automatically than a function that processes many items.
  2. Getting a piece of code to run at a precise time is challenging at scale.
  3. Alerting and instrumentation should be first class citizens of any project. We would have saved ourselves and our hosts a lot of headaches if we had a better picture of what code is doing.
  4. Don’t be hasty about trying to migrate to a new system. Use what data you have to investigate issues, and identify weak points before you start selecting tools to address the problem.

Conclusion

In summary, our journey from the old scheduling system to the new, more resilient one has taught us valuable lessons about system architecture, performance optimization, and the importance of thorough investigation before migrating to new technologies. As we continue to evolve and enhance our systems, these lessons will remain at the forefront of our development process, ensuring that Turo’s services remain dependable and efficient for our valued community of hosts and guests.

--

--