Building a Distributed, Fault-tolerant scheduler with Go and Redis.

Aarthik Rao
3 min readMay 11, 2019

--

Photo by freestocks.org on Unsplash

The aim is to build a distributed scheduler that can handle thousands of jobs in a small time frame.

Why Golang:

It uses Go-routines for concurrency. They are lightweight. You can spin up thousands of Go-routines for processing the jobs. Due to their very small initial stack, memory consumption is less. The Go runtime manages the routines throughout from creation to scheduling to teardown. All we have to do is run go process(some_job) a thousand times. We can use channels for communications between Go-routines.

Why Redis:

Redis is an in-memory data store that supports very fast writes and reads. The pop operation on a Redis list will guarantee that only one client will get the job at a time.

The Plan:

Build a job scheduler that:

  • Accepts a request to schedule a job at a certain time.
  • is fault tolerant.
  • is distributed.

Concept :

Sorted sets are similar to sets in which every member has a score associated with them. The set is sorted based on this score. While members are unique, scores may be repeated. This data structure is used to store jobs.

On a request to schedule a job(through a REST API), the application will add the task to the scheduleRedisKey and backupRedisKey sorted sets.

Processing a job :

The processor routine will use BZPOPMIN to fetch the member with the least score (in our case time) from scheduleRedisKey. If there are no members, the connection will block for a certain time. This will reduce the polling frequency. If the time for the job fetched from the scheduleRedisKey set is less than maxTimeWindow(a fixed time), the routine will wait for the specified time and execute the job.

Since Go-routines are scheduled cooperatively, we use time.Sleep(500 * time.Millisecond) so that the current Go-routine will block. We will execute the job by publishing the job details over RabbitMQ, Kafka or HTTP to the client (a microservice).

Photo by JESHOOTS.COM on Unsplash

The Feedback loop:

To make this application fault tolerant, a feedback routine is running in every node. Note that we have added the job to scheduleRedisKey and backupRedisKeywhile scheduling it. In the processor routine, we pop the job from the scheduleRedisKey and on the successful execution of the job, we remove the job from the backupRedisKey. In case a node fails during execution, the job will remain in the backupRedisKey. The feedback routine will fetch these jobs and reschedule it.

Putting it all together :

The full source code will be found in this GitHub repo:

Improvements:

  • In case the next job lies in the future (say 2 hours), our processor routines will repeatedly fetch and reschedule the job. We can add a watcher instead, and wait for it to notify us of new incoming jobs so we save all those network calls.
  • We can synchronize the feedback routines with locks so that every node doesn’t reschedule the same job again. Note that one instance of a feedback routine is running in every node.
  • We could use a database to store the backupRedisKey set. However, this might increase the write latency.

P.S : This is a just a concept and needs work before it is production ready

--

--

Aarthik Rao

Backend, Game developer. Interested in technology, philosophy and politics. Twitter: @aarthikrao