Celery at Scale at 23andMe

23andMe Engineering
23andMe Engineering
6 min readMar 2, 2022

Scanner Luce, Technical Lead Engineer at 23andMe

The shape of the problem

At 23andMe, we maintain customer-facing websites as well as backend compute pipelines that process a staggering amount of data.

This requires a system that can asynchronously, as well as on a schedule, execute tasks quickly and reliably. We need to know how reliable these tasks executed and how to deal with any kind of failure. Some tasks may execute quickly. Others may take hours or days to complete. Thus, task definitions need to be extremely flexible.

The shape of the solution

We are using Celery with RabbitMQ to execute tasks quickly and reliably. For the execution of scheduled tasks, we use celery-beat. These are well-understood, mature, and reliable systems.

Here, we lay out how we deploy Celery and celery-beat at 23andMe, and how they serve as the glue between our compute pipelines and our customers, handling millions of requests per day.

Understanding the types of work to be done

The tasks we are asking this system to accomplish fall into three main categories:

  1. Responsive requests that may be too slow or involved for a web request, but still need to be resolved and completed as quickly as possible for a user of our services.
  2. “Batch” or “cron” processing, which is work that is driven either by a schedule or by the ingestion of data into the system from sources such as our genetic data processing labs.
  3. High-priority tasks that monitor, control, dispatch, and report administration and status of the entire system. These must not be blocked by interactive or batch requests.

Do more without working harder

We need a system that we can maintain, deploy, debug, and monitor without needing an expansive staff. There are better things to do with our time than micromanage Celery workers to keep up with an asynchronous release process. The release process itself should remain independent of how the underlying Celery workers go about their daily lives.

Along with that, we strive to use as many off-the-shelf solutions as we can. As such, we are using Amazon Web Services (AWS) for a majority of our services, because it offers us a lot of tools we can leverage out of the box.

The bits and pieces

Splitting work into AWS Auto Scaling groups

One of the things that Celery offers is the ability to dispatch work to various queues and have workers pull tasks off of those queues. The creation of queues is cheap. We use a lot more queues than the above diagram shows.

We use AWS Auto Scaling groups (ASG) for each kind of work we need to do. These ASGs are defined in an identical fashion via the same template with custom parameters. These are based on what kind of EC2 instance they use, how they are allocated, and a minimum number of instances. This makes it easy to create a new queue when we need one. The `fast` queue requires a certain minimum number of instances to respond to user traffic and grow as it needs. The `batch` and `spot` queues have less demand to grow as quickly and can take into account pricing pressure. The administrative `priority` queue will not require many instances but it requires a certain number to always exist. The `celery` queue is the default queue. It is for traffic that is interactive but does not have the same response time requirements that the `fast` queue does.

Since our workloads always have work to do, we have not had to worry about an ASG that needs to scale up from zero when work arrives.

The CloudFormation template used for these ASGs is tied to the behavior of the Celery worker processes themselves that communicate back to AWS on how active the ASG is.

The key elements of this CloudFormation template look like:

AWSTemplateFormatVersion: '2010-09-09'
Description: Celery worker stack
Parameters:
ASGMaxSize:
Description: Maximum size of the ASG
Type: Number
ASGMinSize:
Description: Minimum size of the ASG
Type: Number
MaxInstanceLifetime:
Description: MaxInstanceLifetime of the ASG
Type: Number
Default: 5400
WorkerCluster:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
Cooldown: 30
MinSize: !Ref ASGMinSize
MaxSize: !Ref ASGMaxSize
MaxInstanceLifetime: !Ref MaxInstanceLifetime

WorkerCPUScalingPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref WorkerCluster
PolicyType: TargetTrackingScaling
Cooldown: 30
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 70
DisableScaleIn: true
WorkerTasksScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref WorkerCluster
Cooldown: '30'
ScalingAdjustment: !FindInMap [ CeleryInstanceType, ScaleUpAdjustment, !Ref CeleryInstanceType ]

The Celery broker (RabbitMQ)

RabbitMQ is the gold standard for high throughput, reliable, low latency, and high availability clustering among message brokers.

It may seem like a daunting system to set up, but both RabbitMQ and AWS have out of the box solutions for this, such as Automatic Peer Discovery plugins and detailed guides for AWS EC2.

The life cycle of a Celery worker

Traditionally, Celery workers are long-lived processes that keep running until they are restarted for a new release (or simply terminated due to lack of work to do). We have instead decided that Celery workers will run for around one hour. After that, they will stop accepting new work, and when all currently processing work is finished they will quietly exit. The ASGs are set to scale up when existing workers are above a certain level of activity.

Also, the ASGs are configured to only scale up but not down. We rely on the Celery workers themselves to instruct AWS whether a new instance should be created to replace the terminating instance. The Celery workers are aware of how active they are and if a worker reaches the end of its one hour run while being idle, it instructs AWS to not scale a new instance to replace it upon exit. With this, the ASG will quietly scale down when there is little work to do and automatically scale up as more work arrives. We set minimums for the number of workers to handle instantaneous load and give the ASG enough time to scale up to handle the new work.

Furthermore, with the workers constantly starting, within an hour, a new release will have automatically propagated to all the workers. The only configuration needed is to set which Amazon Machine Image (AMI) the new instances will use. Rolling back is as easy. If we need to quickly cycle instances, we use the Celery worker’s remote command system for existing workers to finish their tasks and exit. When the worker exits, the EC2 instance in the ASG will terminate. If there are not enough workers, the ASG scales up with the configured AMI.

Monitoring and Logging

Failure is an unavoidable part of software engineering. Being aware of failures, seeing stack traces, analyzing performance, and detecting bottlenecks are the flesh and bone problems of a modern large system. Luckily there is a plethora of third party solutions that can report and analyze such failures. The task for us becomes how to funnel this information to these providers.

Celery has a facility for handling events. This lets us capture the high level events as the Celery workers process tasks. Relaying these to a system such as Sumo Logic lets us debug, alert, and monitor unusual conditions and failures. New Relic provides us with performance and availability monitoring. We’re armed with powerful tools that let developers understand failures and performance issues without needing access to the production systems themselves.

Conclusions

  1. Using these components we can handle millions of asynchronous requests per day. Scaling up becomes a matter of tweaking parameters in AWS Cloud Formation templates.
  2. Deploying new releases does not require a system to shut down and restart workers manually, cycling them for new releases. Once a release has been promoted, the Celery workers’ existing one-hour life cycle means it will flow out automatically.
  3. Leveraging Celery’s event and worker system lets us collect logs, metrics, and instrumentation that are easy to browse, create monitors for, and investigate by any developer on the team.

About the Author

Scanner Luce is a software developer that has been working on queueing and computation systems for over 30 years. Their favorite motto regarding programming is “treat all code as if it were production code, because if you do not, it will be.” Their guiding principle when writing code is that readability is often more important than performance.

23andMe is hiring! Check out our current openings!

--

--