Background Job and Queue — Practical Application Use Cases
In my previous article, I introduce the concepts of queue and background jobs via an illustrative pharmacy example, its application in practice together with advantages and limitations. This article is part two of it, reviewing different job types and how it comes to the picture via several common applications.
Job Definition and Categories
There are a couple of background jobs, depending on how and on which kind of data they perform. I suggest you find the details in . Here I have summarized some important aspects and practical use cases that you can imagine how background jobs look like.
In general, background jobs are triggered in 2 ways:
- By event (event-driven triggers): The task is started in response to an event. It could be that a request is sent to an API endpoint, a log is generated, and needs to be saved to the database, and so on.
- By scheduled (schedule-driven triggers): The task is performed on a timer-based schedule. It could occur periodically (hourly, daily, etc.) or a one-time run scheduled for a later time.
This kind of job is invoked when the events with unknown occurrence times are detected. Typically, these jobs are created via message queues and performed by worker nodes. Upon detecting an event, a message will be pushed to the queues while the worker nodes keep listening to and process one by one. Some examples of these jobs include:
- Sending an email to activate new users
- Image/video processing
- Generate report as requested
The scheduled jobs are generated for the periodical tasks whose start time is pre-determined or repeatedly. Technically, these jobs are invoked by using crontab, interval, or true loop code. For example
- Publish a scheduled story
- Daily clean the system
- Weekly report :)
Use Cases (UC)
By going through some use cases, I will explain how the jobs are implemented and orchestrated.
UC1- Pageview counting
The problem statement is quite simple, i.e. precisely display the number representing the views of a web page (or a product, a post). Intuitively, one view on a page increases the value of pageview by 1.
1. Non-queue version
The implementation is visually explained via the following figure:
This approach faces some issues, that is:
- Blocking IO: The user needs to wait for the database to be updated. He should not do that though.
- Performance: Imagine the amount of workload at the database if your system is a large-scale one and needs to serve 1000 users at a time who might also play around with your website.
2. Queue-adopted version
Instead of directly sending an update query to the database, the API engine sends the queries to a job queue with several worker nodes waiting to process these jobs (increase the pageview value by 1).
- Non-blocking IO: Since the latency for pushing a task to a job queue is negligible comparing to the job processing time, the response to the user takes place almost immediately and thus increase user experience.
- Throttling: Now if there are 10 workers, each only processes 1 job at a time, then you will have always 10 jobs in parallel. It means that even though there are 1000 users viewing the same web page, only 10 update queries are performed at a time. In other words, the database will likely appreciate this approach.
- Performance: As a result of the throttling, there might be the case of a large number of jobs still in the queue that all the 10 workers are too busy to proceed. So while the database appreciates this approach, the entire system is complaining. That’s a trade-off.
- Busy IO: Although the database only runs 10 queries at a time, it keeps being busy until all the 1000 requests are done. This would adversely affect other tasks’ performance.
3. Combine different jobs
In order to mitigate the database bottleneck, caching is the first candidate since it is much better in processing IO jobs than many databases. The idea of this version is to leave caching the burden by using a scheduled job to retrieve the cached data and write it down to the database.
As you can see, the event job of increasing the pageview by 1 only interacts with the cache and the scheduled ones periodically fetch the uncounted views (-n views) from the cache and save them to the database (+n view)
What has been resolved:
- Performance: Since pageview data is temporarily cached, we can add more workers to leverage the IO processing ability of cache and therefore lower jobs’ processing time. In other words, the queue’s length is shortened.
- Throttling further: Since the database is updated by scheduled jobs, we can adapt the number of updating queries to an optimal value. For instance, if a scheduled job runs every 10s, there will be 6 updating queries per minute instead of 1000 queries as it does in version 2.
Unfortunately, there is still another issue (sorry man, still getting you with issues, after this far):
- Delay data: Using caching, scheduling job means that the pageview will not be updated in real-time but a jitter depending on the frequency at which the scheduled jobs are invoked. But at least, this jitter is worth doing consider what this approach benefits from.
- It is very important to secure atomic operation (please Google for more info) when building a system with background jobs. You can see in the figure, the operations -n and +n are used as they are atomic for most of the database or cache. One tip is that you should not use get and set counter to 0, but get then minus the counter by its current value (the current value might be different from the newly obtained one from the cache) instead to avoid the pageview data loss when resetting.
- For Redis, we have GETSET (Redis <= 6.2) and SET with GET parameter (Redis >= 6.2) to reset counter.
UC2- Sending a multicast email
This is a common requirement for news, subscription-enabled, reporting systems where specific content needs to be periodically summarized and sent to a set of users.
1. Non-queue version
There is only one scheduled job that obtains a list of users from the database and sends the email to each of them.
- Latency when the number of users increases.
- Hard to retry: I really don’t want to imagine a situation in which that job fails and needs to start over, handling blah blah (sorry, too many things to list here)
2. Queue-adopted with jobs-combined version
In this version, the scheduled worker is the manager responsible for creating tasks for multiple event workers in parallel via the job queue.
- Scalability of scheduled worker with only one task, that is, creating tasks and pushing them to the job queue.
- Reduced latency since the tasks can be processed by paralleled workers, the total converging time will be significantly reduced.
- Retryable: Since each job handles sending email to one or a subset of users, a failure of a job will not affect the others. Also, only failed job needs to start over, not the whole system.
UC3- ETL Process
ETL stands for Extract Transform Load, a terminology of a procedure of processing data from a source system to a destination system. Typically, it is to copy data from Operation systems to Analytic and Reporting systems.
Although there are many tools designed to do this job, i.e. Apache Kafka, Spark, I mention this use case to give you another insight into how background jobs and queues are adopted. Note that during the ETL process, all the jobs need to be retryable or re-runnable without any error or redundant data.
1. Simple queue version
I believe this is what happens in your mind at the beginning. The idea is simple, every, say, 1 hour, you start a job to pull all the data from the source database, do some blah blah things on that data, and push to the destination database. That’s it.
Issues are the same as what you have with Version 1 of UC2, even more severe:
- Latency: It needs both time and resources for the data going between 2 databases. And if you set a long interval, the data at the destination database can be outdated. But with a short interval, it is possible that the previous job has not yet finished when a new job arrives.
- Hard to retry: yeah, it is again.
2. Multiple scheduled-jobs version
Here the ETL process is divided into 3 sub-processes handled by 3 different scheduled jobs. For a few data records, these 3 subprocesses will be likely performed one by one. However, when the data volume grows up, they can run in parallel and as a result, the convergence time can be considerably reduced.
The only issue you need to handle is to track data status, in terms of at which stages the data is being processed, whether or not that stage is done, and successful or fail, and most importantly, in order or not?
3. Hybrid version
This approach enables the scalability to version 2 by adding event jobs for each process step, i.e. E, T, and L. It resolves most of the concerns about performance, scale, delay, and is applicable to the systems that need to synchronize data with low latency thanks to the delay reduction of each ETL time.
- Tracking data status: need to be aware at which stage the data is under processing.
- Data paging: The scheduled jobs should be able to partition data and assign it to event jobs for parallel processing. Such data paging information can be stored in a table as shown in the following figure. Just note that it is really not a simple task to have efficient data paging :)
I hope this helped you to develop an overview of 2 types of background jobs as well as their application in use cases to improve system performance.
I would like to send my big thanks to Quang Minh (a.k.a Minh Monmen) for the permission to translate his original post.
Originally published at https://emerging-it-technologies.blogspot.com.