Quartz to Temporal: Modernizing Job Scheduling at Turo

Ayush Mudgal
Turo Engineering
Published in
11 min readJul 10, 2024

Introduction

At Turo, maintaining a robust and scalable job scheduling system is critical to the seamless operation of our car-sharing platform. As our user base and reservation volume grew, our reliance on Quartz for scheduling began to show significant scalability and reliability issues.

Due to the numerous challenges with Quartz, we began to explore alternative scheduling platforms. After a thorough evaluation, we adopted Temporal, marking a significant turning point in our operations. This decision not only addressed our pain points but also propelled us towards a more seamless, efficient, and scalable system.

In this article, we will walk through the technical challenges we encountered with Quartz, the criteria we used to evaluate new systems, the reasons behind choosing Temporal, and the steps we took to implement it successfully. This detailed account highlights the strategic decisions and collaborative efforts that enabled us to overhaul our delayed job platform, setting a new standard for operational excellence at Turo.

The Challenge: Scaling Our Delayed Job Platform

At Turo, we’ve always strived to offer seamless experiences to our users. As our platform grew, so did the complexity and scale of our operations. This growth led to significant challenges with our delayed job platform. Initially, Quartz was our go-to scheduling system, perfectly fitting our Java-centric tech stack. However, as our reservations and associated tasks grew, Quartz started to falter under the pressure. Here are some of the critical issues we encountered:

  • Misfires: Jobs were not firing at the expected time, causing delays and inconsistencies.
  • Stuck Jobs: Misfired or regular jobs would enter error states, requiring manual intervention.
  • Death Loops: Quartz, in trying to recover from misfires, often prioritized them, leading to more misfires and system inefficiency.
  • Scalability Issues: With the increase in instances of jobs, we couldn’t simply increase the number of pods as there was contention between the Quartz scheduler threads. Per Quartz own documentation: “The scheduler makes use of a cluster-wide lock, a pattern that degrades performance as you add more nodes (when going beyond about three nodes — depending upon your database’s capabilities, etc.).”.
  • Lack of Prioritization: Quartz treated all jobs equally, with no distinction between critical and non-critical tasks.
  • Execution Complexity: We had no straightforward way to decouple job scheduling from execution, increasing system complexity.

This problem was made worse by us using the same Quartz cluster for both long running, resource intensive jobs and short running numerous jobs.

A Quartz incident where a stuck job, that kept misfiring, caused Quartz to go into a “misfire death loop” causing all of scheduled jobs to misfire.

Attempts to Improve Quartz Performance

In an attempt to mitigate these issues, we experimented with several enhancements to Quartz:

  1. Decoupling Job Firing from Execution: We tried separating the job scheduling logic from its execution. This provided us with resiliency but added additional complexity to our system, requiring manual intervention every time a job failed.
  2. Increasing Job Execution Threads: We increased the number of the threads that executed jobs aiming to improve job handling capacity. While it provided marginal gains, we were still bottlenecked by Quartz scheduler threads.
  3. Introducing Batch Modes: We enabled Quartz batch mode functionality, which resulted in some performance gains. However, it led to a livesite incident where jobs that were not supposed to fire concurrently started firing concurrently, causing our database and scheduling threads to become overwhelmed.
  4. Job Prioritization: We attempted to prioritize critical jobs over less critical ones using Quartz’s built-in priority feature. This provided some gains, particularly for our critical jobs during heavy load times. However, it was unsuccessful in meaningfully improving the overall performance.
  5. Horizontal Scaling: We added more instances to handle the increased load, which also contributed to increasing scheduler threads. Although we initially saw improvements, as we continued to scale, performance degradation occurred, leading to more misfires.

Despite these efforts, the improvements were marginal, and the complexity of our system continued to grow. We also considered introducing another Quartz cluster to handle the load, but realized this would risk significantly increasing the complexity of our system for only marginal returns.

“The Homer”: a car, designed by the character Homer Simpson in “The Simpsons,” is a comically over-the-top vehicle featuring a bubble dome, multiple horns, tailfins, and an array of unnecessary gadgets. It’s a satire of excessive customization for marginal returns.

The Need for a Better Solution

Realizing the limitations of our current system, we decided to look externally for a platform that could not only meet our current needs but also scale with us into the future. We established a set of evaluation criteria based on our experience with Quartz:

  • High Scalability: The new system must handle increasing workloads effortlessly.
  • Fault Tolerance: It should be resilient to failures, ensuring system reliability.
  • Cost-Effectiveness: The solution should be economical, both in terms of infrastructure and operational costs.
  • Full Management: A fully managed service to reduce operational overhead.
  • Developer Experience: It should offer a positive and productive developer experience with easy onboarding and local testing.
  • Editing and Signaling Capabilities: The ability to edit and signal workloads as needed.
  • Concurrency Control: Efficiently manage multiple concurrent tasks.

Exploring Quartz Replacement Solutions

We explored several platforms, each with its unique strengths and weaknesses:

  • Dynein (Airbnb’s delayed job execution system)
  • Camunda
  • AWS Step Functions
  • Netflix Conductor
  • EventBridge Scheduler
  • Temporal

After thorough evaluation, Temporal and Dynein emerged as the top contenders. Temporal stood out for its comprehensive feature set that matched all our requirements, while Dynein was compelling due to its champion, Airbnb. Since Airbnb hosts homes and we host cars, their reservation lifecycle was very similar to ours. If they had seen success with a certain scheduling platform, we were interested in understanding it further as well.

Consulting with Airbnb

After connecting with the Temporal team and starting our conversation with them, we learned that Airbnb was also a significant Temporal user. This piqued our interest further. After explaining our doubts and curiosity to the Temporal team, they graciously agreed to connect us with engineers from Airbnb. Their journey resonated with ours: starting with Quartz, transitioning to Dynein, and finally adopting Temporal. The Airbnb bookings team recommended Temporal, especially its cloud offering, to avoid the complexities of in-house management.

Planning and Developing the MVP

With Temporal identified as our ideal platform, we embarked on developing a Minimum Viable Product (MVP) to validate its capabilities.

What is Temporal?

Temporal

Temporal is an open-source, distributed, and scalable workflow orchestration engine designed to manage complex business logic and long-running processes in a reliable and fault-tolerant manner. It provides a framework for building, running, and monitoring workflows, allowing developers to focus on business logic rather than dealing with infrastructure concerns such as state management, retries, and coordination. It introduces several core concepts:

  • Workflows: Durable, reliable, and scalable functions that represent business processes.
  • Activities: Well-defined actions within workflows, which can be short or long-running.
  • Workers: Processes responsible for executing the workflow and activity code.
  • Temporal Cluster: Services that orchestrate the execution of workflows and activities.

We opted for Temporal Cloud, a fully managed solution, to accelerate our transition from Quartz and offload the operational burden.

Understanding Cron vs Dynamic Jobs

Our scheduling system supported two kinds of jobs:

  1. Cron Jobs: Scheduled to run at regular intervals.
  2. Dynamic Jobs: Scheduled individually based on specific events at specific times. Each of these jobs would usually have thousands of instances, one for each time the event was fired. Each instance would be deleted once the job was done executing.

We needed to translate these job types into Temporal workflows, starting with 160 types of Quartz jobs, about 140 or so of which were cron style jobs and 20 of which were dynamic jobs. Each of the dynamic jobs had thousands of triggers in production indicating that we were in for a massive lift. To narrow the scope of our MVP down, we initially focused on jobs related to trip/reservation events.

Upon looking a little deeper, we noticed that about 20 or so of the 160 jobs were related to the trip/reservation events and could be logically connected. For the first time we were seeing a reservation workflow definition.

Anchors: Simplifying Workflows

As a clearer picture emerged, unfortunately, so did overlaps. We had originally thought of having a single workflow for our reservation lifecycle. However, we quickly realized certain events within a reservation might conflict. Consider the ambiguous workflow definition below.

An example of an ambiguous workflow definition. In this example, the workflow involves 5 actions with S as the reservation start time and E as the end time. The third action occurs 12 hours after S, and the fourth action 24 hours before E. This setup functions well for long reservations but leads to overlaps when the reservation duration is shorter. The second example highlights the ambiguity in placing the third and fourth events within the workflow.

This led us to the concept of “anchors” — specific events around which workflow actions revolve. For example, the “start trip time” anchor includes jobs like adding an additional driver, sending a reminder to download the app, and initiating the trip while the “end trip time” anchors include jobs like sending a trip extension reminder and initiating the end of a trip.

This approach enabled us to create simpler, more focused workflows.

Workflows with specific anchors. In the examples above, “Start Trip” and “End Trip” are anchors, while the rest of the actions are dependent on them. This would mean we’d have to create two different workflows with 3 actions each.

MVP Actions

We focused on two critical jobs with a history of challenges: Start Trip and Trip Extension Reminder. These jobs revolved around different anchors, resulting in two distinct workflows: the booked start and booked end workflows.

Ensuring Idempotency

Idempotency, ensuring repeated operations produce the same result, is a crucial part of Temporal workflows. Thankfully, our job execution code already incorporated idempotency, developed in response to Quartz’s past failures and restarts.

Example of an idempotent system: pressing the car unlock button over and over again produces the same result.

Temporal in the Monolith vs Microservice

We decided to place our reservation workflows and workers in a separate microservice. This decision aligned with our broader goal of decomposing our monolith and avoided further bloating our existing services. It also contributed to a better developer experience.

MVP Design

We created a reservation workflow microservice to house all reservation-related workflows and workers. Temporal activities within these workflows called a single RPC endpoint in our monolith, executing the job code when triggered. The microservice listened to queued events from the monolith, allowing us to create, edit, or delete workflows based on reservation events dynamically.

Design of our MVP

Implementation, Testing and Ramping Up

To ensure a smooth transition, we used the Temporal CLI for local testing. We also employed feature flags in our RPC endpoint, allowing us to run Quartz and Temporal simultaneously and compare their executions.

We gradually ramped up from logging to executing jobs over a day, monitoring everything closely. We were especially metrics around worker performance and resource consumption. Temporal significantly outperformed Quartz, evidenced by an increased number of idempotent checks.

Example illustrates that Temporal was outpacing Quartz in terms of executing the trip start job and all Quartz was doing was executing idepotency check i.e. it had missed the mark and was just re-excuting the code that the Temporal workflow had already timely executed.

Worker Tuning and Analyzing Results

Over the next few days, we monitored our Temporal dashboards and noticed a slight increase in worker execution latency and a drop in sync match ratios, indicating a task backlog was forming. This was due to long-running workflows, with most executions occurring at the 30-minute mark.

We consulted with Temporal Solution Architects Keith and Tiho (thank you, Keith and Tiho) and discovered that during replay, the workflow was fetching workflow event history from the Temporal cluster, which contributed to the increased latency. To address this, we increased the workflow cache size, allowing the event history to be cached within the worker itself. Additionally, we scaled up the number of worker pods and task slots to ensure our workers had sufficient headroom. These adjustments led to more efficient workflow executions and a successful MVP. 🎉

Results of our worker tuning session. Notice the latencies and the drastic drop after our tuning session.

MVP Success and Looking Beyond

With the completion of our MVP our focus shifted towards our post-MVP goal: retiring Quartz by year-end.

Analyzing Quartz Jobs for Migration

Pausing our reservation workflow microservice development temporarily, we undertook a comprehensive analysis of the left over 140 Quartz jobs. We categorized these by domains, identified key events, and engaged stakeholders to understand each job’s purpose. This meticulous process resulted in comprehensive catalog.

Lift and Shift Strategy: Embracing Temporal Scheduled Workflows

While the potential for workflow improvements was clear, our primary objective remained retiring Quartz. We realized that creating comprehensive workflows for each job could result in maintaining Quartz indefinitely. Therefore, we opted for a pragmatic “lift and shift” approach. This involved developing a new centralized scheduled workflows microservice to manage all recurring jobs.

During the service design phase, we chose an asynchronous approach using RPC calls to our monolith for job execution. Each workflow included signaling mechanisms for status updates (success/failure) and defined timeouts for non-responsive executions. We ensured that all these workflows were powered by Temporal’s schedules, which executed on a cadence and prevented concurrent execution of identical workflows.

Design for our schedule workflow microservice.

Config-Based Migration: Simplifying the Transition

Faced with the daunting task of migrating over 140 jobs, we innovated with a config-based approach. Recognizing commonalities across jobs — such as RPC action names, cadences, and timeouts — we created a reusable configuration model. This streamlined our migration timeline from months to weeks, often migrating multiple jobs in a single day.

A sample yaml config. Each of these configs would created a Temporal schedule in the background with listed configuration. Since majority of our Cron jobs we were just migrating the orchestration layer, this config made our migration simpler.

We were able to migrate all 140 Quartz Jobs using this approach in a matter of weeks, wrapping up our migrating in record time.

The Future with Temporal at Turo

Temporal now plays a pivotal role in many of Turo’s engineering initiatives. We aim to:

  • Workflowize When Possible: Integrate cron jobs (currently powered by Temporal schedules) into robust, scalable dynamic workflows.
  • Domain-Specific Workflow Adoption: Collaborate with domain owners to implement reliable systems akin to our reservation workflow service within their own domains.
  • Expanded Adoption: Continue to pave the way for broader adoption of Temporal.

Key Takeaways

We learned a lot from this migration. Here are a few key learnings worth mentioning:

  1. Planning is Crucial: Dedicating time to plan each step, especially during the evaluation of the Quartz landscape, enabled us to set clear and achievable goals that addressed both short-term needs and long-term objectives.
  2. Validate Your Plans: Identifying similar processes we needed to repeat, implementing them and validating against our initial estimate gave us confidence that our plans were accurate and feasible within the estimated timeframe.
  3. Focused Goals Drive Success: Maintaining a strong focus on our primary goal of retiring Quartz by year’s end guided our decisions. We evaluated each action based on whether it helped achieve this goal, ensuring we stayed on track.
  4. Incremental Changes for the Win: Since our platform wasn’t initially ready for a complete adoption of workflows, we concentrated on making numerous improvements to enhance reliability. This prepared the platform for future workflow integration. Incremental changes ensured steady progress without overwhelming the system.
  5. Engage with the Community: Engaging with the broader engineering community is invaluable. The Temporal community, in particular, is vibrant and collaborative. Our partnership with Airbnb exemplifies the benefits of such collaborations.

Acknowledgements

This was a team effort, and I’d like to thank everyone involved, both from Turo and Temporal.

I would, however, like to give specific shoutouts to:

  1. Members of Turo Engineering: Adam Safran, Moaj Musthag, Alice Chen, Andre Sanches, Josh Wickham, Minglun Gu, Russell Rogers, Shruti Goel, Victor Mora, Yohan Belval, Adam Bovill, Doug Gschwind, Brian Pham and Harrison Wang.
  2. Members of the Temporal Team: Maxim Fateev, Samar Abbas, Alex Cort, Kevin Martin, Keith Tenzer, Tihomir Surdilovic, Taylor Khan.

Thank you for reading! If you have any questions or want to discuss further, feel free to reach out.

If you enjoyed the post and are interested in putting the world’s 1.5 billion cars to better use, Turo is hiring!

Also, this post is a text version of a talk we gave at Temporal’s annual conference! You can find a link to the talk here.

--

--