Optimize Your Laravel Application’s Performance: Effective Strategies for Implementing Job Retries

Published in

Insider Engineering

5 min readMay 29, 2023

Laravel Queues is a powerful feature of the Laravel framework that allows developers to handle time-consuming tasks and improve the performance of web applications by running these tasks in the background. It provides an easy and efficient way to defer the processing of long-running, resource-intensive tasks such as sending emails, processing large files, and interacting with external APIs. Instead of waiting for these tasks to complete, Laravel Queues allow developers to push them to a queue and return a response to the user immediately, freeing up server resources and improving user experience.

However, it’s important to acknowledge that sometimes our queued tasks can fail for various reasons, such as running out of memory or disk space, hitting external API rate limits, experiencing connection issues, encountering temporarily unavailable resources, and more. In such cases, it may be necessary to implement a retry mechanism to attempt to run the job again.

How can we retry failed jobs?

Laravel provides the following instruments to retry failed jobs.

Manual retry

We can trigger the command

php artisan queue:retry --queue=name

We can specify job ids or ask to retry all the jobs. Also, we can move this command to our code and trigger job retrying by some algorithm.

If we have a horizon package, we can manually retry failed job from the panel

Retry for a time period

We can place into our job the following function:

public function retryUntil(): DateTimeInterface
{
   return now()->addDay();
}

This way our job will retry again and again for the next 24 hours until it will finish successfully or the time period will expire.

Pros

Rate-limited jobs will be handled during the period easily
If we have some network troubles for hours (i.e. our REST API server is down), we will easily retry it again after it will be fixed

Cons

If a job repeatedly fails due to an exception that is not related to network or rate limit issues, retrying it indefinitely can result in unnecessary resource consumption and a heavy load on the queue worker.
If the queue worker is heavily loaded or down for a period of time longer than the configured expiration time for the job, the job will not be retried and will be marked as “expired”.

The main problem is that if we have a queue worker loaded and can’t start the job in a retryUntil period of time, we will get this error:

{{ job_class }} has been attempted too many times or run too long. The job may have previously timed out.

Retry for some fixed attempts

We can place into our job the following parameter:

public $tries = 10;

This way our job will retry no more than 10 times and fail after.

Pros

If we have some exception not related to network/rate limit, it won’t be stuck during the period
Our job can be retried even a year later successfully if we have remaining tries.

Cons

We should control the rate-limited jobs. Each skipping increases tries to 1. Can be partly solved by linear/exponential backoff, but in the case of huge amounts of jobs with small rate limit problems can happen.
If we have some network problems, fixed retry attempts can be exhausted during the downtime period. Can be solved by exponential backoff or manual retry later

Optimization for jobs retrying

If the job failed due to a coding mistake, it’s likely that the job will keep failing when retried, leading to a large number of failed jobs and a potentially overflowing queue. In such cases, it’s essential to identify and fix the root cause of the failure before retrying the job.

To avoid overloading the queue with a large number of failing jobs, it’s important to set reasonable limits on the number of retries and timeouts for each job. Depending on the scale of the system and the expected failure rate, it may also be necessary to implement a backoff mechanism that gradually increases the time interval between retries to avoid overwhelming the system.

Furthermore, it’s crucial to monitor the system’s health and performance continuously and to alert the team in case of excessive job failures or queue overflow. With proper monitoring and alerting in place, the team can quickly identify and resolve issues before they lead to significant downtime or service disruptions.

To avoid repeating jobs that are not likely to be fixed by simply retrying them, we can use the $maxExceptions parameter in our jobs

/**
 * The maximum number of unhandled exceptions to allow before failing.
 *
 * @var int
 */
 public $maxExceptions = 3;

It’s important to selectively catch only the types of exceptions that can potentially be solved by job retrying. For example, catching connection exceptions, rate limit exceptions, shared resource access limitations, and similar errors can allow for more efficient retrying of failed jobs.

It works well with some delay between retry attempts:

/**
 * Calculate the number of seconds to wait before retrying the job.
 */
public function backoff(): int
{
    return 3;
}

or even exponential-like:

/**
 * Calculate the number of seconds to wait before retrying the job.
 *
 * @return array<int, int>
 */
public function backoff(): array
{
    return [1, 5, 10];
}

The delay algorithm for retrying failed jobs can be more complex than a simple fixed time interval. For example, if we receive a rate limit error from an external server, we can use the rate limit data provided by the server to implement a backoff mechanism.

In this case, the backoff mechanism would use the retry-after time provided by the server to determine the appropriate delay before retrying the job. The delay could be gradually increased for each failed attempt, using a formula such as exponential backoff or truncated binary exponential backoff.

By implementing a more intelligent delay algorithm that takes into account external factors such as rate limits, network conditions, and resource availability, we can reduce the likelihood of overwhelming the system with a large number of failed job retries. This can improve the overall system reliability and reduce the risk of downtime or service disruptions.

Extras

There are various rate limit middlewares and custom retry logic that can be used to enhance the flexibility of the Laravel Queues retry mechanism. For example, implementing a rate limit middleware can help automatically return jobs back to the queue if we have reached the limit for the current period. This can help prevent job failures caused by rate limits and allow for more efficient job processing.

In addition, using custom logic to decide whether or not to retry a job can help optimize the retry mechanism for specific use cases. For example, we may want to retry a job indefinitely if it failed due to a temporary network error, but not if it failed due to a coding mistake. By implementing custom retry logic, we can fine-tune the retry mechanism to better suit the needs of the system and improve overall reliability.

Overall, the Laravel Queues retry mechanism is flexible and versatile, allowing for various customizations to meet the needs of different use cases. By leveraging the different features and options available, we can design a robust and reliable job processing system that can handle various failure scenarios and ensure the smooth operation of our application.

You may also consider reading about Laravel Pipeline pattern or how to deploy it to Kubernetes.