Configuring the KubernetesExecutor to Hum at Etsy
By Victor Chiapaikeo and Aldo Orozco Gómez Serrano
The postings on this site are our own and represent our own views and opinions, not those of Etsy.
In 2022 when we finally decided to move our Airflow platform from 1.10.X to 2.X, we weren’t sure what to expect. We were moving from 7 LocalExecutor instances each on their own VM to a single production instance running KubernetesExecutor. Would it scale? Would it stay reliable? And what about performance?
Fortunately, today we are running over 32k tasks on any given execution date on a single production instance of Airflow. It took many months for us to get here, and there were plenty of obstacles along the way. In this article, we’ll discuss some of the main knobs that we turned and how we ultimately configured Airflow to run at scale, reliably and performantly at Etsy.
The Database
One of the most important components to get right off the bat is the metadata database. You’ll hear a lot of talk among the community about how Postgres is the way with a pgbouncer proxy to handle connection pooling. However, we ultimately settled on MySQL because of Etsy’s in-house expertise there. That said, MySQL out of the box did not perform well for us. Early on, we saw repeated locking issues and the occasional deadlock.
We approached these problems from both the database and the application end. On the database side, we changed the transaction isolation level to READ COMMITTED — it defaults to REPEATABLE READ — and while Airflow’s SQLAlchemy configuration does change this at a session level, any other db connection that is made outside of Airflow might not do so. We also bumped innodb_lock_wait_timeout from the default of 50s to 100s to give stubborn locks more time to complete and added a long_query_time flag set to 10s so that we could investigate any slow running query further.
On the application side, there were more levers to pull. There are a number of configurations that Airflow defaults to that end up adding extra stress on the db. For example, the AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK configuration which defaults to 30 causes the scheduler to perform deletes on the rendered_task_instance_fields table which has been known to result in deadlocks. We changed this to 0 to skirt those deletes. Additionally, we made xcom pushes opt-in instead of opt-out. do_xcom_push is a parameter on BaseOperator that defaults to true. To disable it cluster wide, you can use a task policy.
DAG Syncing / Processing
On DAG syncing, we tried a number of different options like Git sync, GCS Fuse, and even using a gsutil cp / rsync in an initContainer. None worked quite as well as using network attached storage, particularly GCP’s Filestore. To get dags onto the file system, we deployed a simple Flask server with a read/write volume mount that accepted REST API calls from our CI builds. And to read dags, we simply mounted a read-only volume mount to our worker pods within the base pod-template.yaml.
On DAG processing, we followed advice in this FAQ on speeding up DAG processing by setting AIRFLOW__SCHEDULER__FILE_PARSING_SORT_MODE to modified_time. We also raised AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL to 300 seconds and AIRFLOW__SCHEDULER__PARSING_PROCESSES to 6. However, we decided to lower AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL to 60 seconds so that users with new dags would see them show up more quickly. We had difficulties adopting the standalone dag processor so our scheduler replicas perform DAG processing. Therefore, we also wanted to ensure they weren’t being overwhelmed while still parsing dags in a timely manner.
K8’s Configs and Resources
Getting Airflow to operate nicely within Kubernetes required even more late evenings. This section is very particular to how you run your K8’s cluster, whether it is managed externally, on AWS’s EKS, on GCP’s GKE, or something else. Luckily, at Etsy, we have an adept Kubernetes SRE team and a few members on our team with prior experience. We were able to spot unusual behavior in system and audit logs like a Calico pod causing Airflow worker pods to get preempted, a downscaler / descheduler killing our low-resource utilizing pods in the middle of the night, or Gatekeeper mutations that caused unusual downstream behavior during pod adoption steps. A bit of a grab-bag here that is highly dependent on your K8’s setup.
However, there are some configurations on the application side that we feel apply to all Airflow K8’s environments operating at scale. Some of these are emphasized in the documentation and others are less obvious. Here’s a few to ensure you can eke out better performance from your cluster all the while making sure you don’t hose it:
- AIRFLOW__CORE__PARALLELISM — this is easily one of the most important settings and while it isn’t part of the CNCF Kubernetes subcategory, it does have a direct effect on how tasks are scheduled by each scheduler replica. You’ll find that if you bump this too much, you will run into DB / K8’s issues and if you don’t bump it enough, your thundering herd of tasks at midnight UTC will be at a standstill. Coming from LocalExecutor, we made the mistake of having this set at 0 (unlimited) which caused instability and locking issues on our db. We settled on setting this to about 10x the default (of 32).
- AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY — we also bumped this in tandem with parallelism. You can set it to 0 which will cause it to inherit the value of parallelism.
- AIRFLOW__KUBERNETES_EXECUTOR__API_CLIENT_RETRY_CONFIGURATION — there are times when KubeAPI operations will fail — often due to network issues. We want them to retry; even the POST requests. This non-obvious urllib3 retry configuration does that for us — {“total”: 3, “backoff_factor”: 0.5, “allowed_methods”: [“DELETE”, “GET”, “HEAD”, “OPTIONS”, “PUT”, “TRACE”, “POST”]}.
- AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE — This defaults to 1 which means 1 pod would get created per scheduler loop! Configuration notes highly suggest increasing it. Ours is set to 300.
One final consideration — make sure your base pod-template.yaml has sufficient resources. We give it at least 500m CPU and half a GB of memory. Under-resourced requests / limits can lead to K8’s throttling a worker pod or worse, the OOM killer killing your pod.
More Airflow-Specific Configuration Changes
There are a few other configuration changes that you should check to ensure high performance. For example, the config AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION turns on the mini-scheduler. This defaults to True but we had set it to False initially because there were db locking problems when it was introduced. On our current version, 2.6.3, it runs fine.
Another non-obvious setting is AIRFLOW__DATABASE__SQL_ALCHEMY_CONN_CMD — note the CMD at the end. On certain SQLAlchemy models, we found that the connection string was not being cached and the db connection commands needed to be re-eval’d on every call, and for every record! This issue (https://github.com/apache/airflow/issues/33485) was eventually fixed in 2.7.1. We simply moved to using the AIRFLOW__DATABASE__SQL_ALCHEMY_CONN (non-cmd) alternative to work around this.
One last setting to check is the CONNECTION_CHECK_MAX_COUNT environment variable that lies in the base spec of pod-template.yaml and also in the deployments. A value greater than 0 means that the container will issue a pre-ping to the database to ensure that the connection is alive before proceeding which increases startup time. We have paging alerts around our DB so we disabled this check by setting the count to 0.
Upgrade, Reduce Your Image Size, Image Stream on your K8’s Cluster (if available), and Pre-Compile Modules in your Image
If you haven’t yet, consider upgrading to at least Airflow 2.6.3 and Python 3.11. Both Airflow version upgrades and Python version upgrades lowered our intertask latency times. We estimate that by doing these upgrades, we were able to squeeze another 2–3 seconds off our average intertask latency.
Additionally, find ways to reduce your image size and if your K8s cluster has the option, turn on image streaming. One of the biggest bottlenecks that we found occurred during autoscaling of the K8’s cluster. New nodes would not have a cached image and would need to pull the Airflow image from our image registry. Check out steps online to reducing a docker image size so that the pull times are reduced.
One final non-obvious optimization was given to us by Jarek (@potiuk). He mentioned that pre-compiling our Python modules could give us a performance boost. Instead of having each pod generate byte code from source files lazily and independently of each other, we perform bytecode compilation once during image build. We added this line at the very end of our Dockerfile and saw a notable improvement in our task times.
# Pre-compile python modules so tasks spin up faster
RUN find ~/.local -name '*.py' | xargs python -m py_compileConclusion
Getting Airflow to perform at the level we wanted to was a challenging but rewarding adventure. We saw average intertask latencies reduced from over 10 minutes to just under 12 seconds. Not only did this provide a vastly better developer experience for our users; it also allowed us to reduce our cloud spend.
It’s worth noting that your mileage may vary with any one of these recommendations. We suggest performing controlled experiments to ensure the configuration change suits your specific workload, database size, and cluster. Some of these trade-offs may not make sense for you. For example, maybe extra minutes of build time is not worth the seconds saved in runtime task latency.
Lastly, we’d be remiss if we didn’t mention the rest of the team that worked tirelessly to get us to the state we are in today. Our teammates Matthew Hall, Eric Soares, Matt Usifer, Brent Hagany, and Dan Winter all made this possible. Additionally, we would be nowhere without the the incredible cast of open source contributors humoring us with our Github issues and PRs.

