A Distributed Job Scheduler Story | Part I

Dogukan Tuna
Trendyol Tech
Published in
7 min readDec 23, 2022

Hello everyone! In this blog post, we will share something about the Developer Experience team at Trendyol, the reasons for our path to the job scheduling platform with Argo Workflows, and our experience at ArgoCon. What we will share within the scope of this Distributed Job Scheduler series was exhibited at CNCF ArgoCon 2022. Before we get to the story, let’s start with CNCF, ArgoCon, and Trendyol’s participation.

P.S.

Before getting started, I would like to thank you all DevX members and SRE Tools members for their contributions on our topics that will be shared in article esp. İsmail Bülbül, Erkan Zileli, Hüseyin Celal Öner for their amazing works. These works done equally by me, İsmail Bülbül, Erkan Zileli and Hüseyin Celal Öner.

CNCF and The Relation of ArgoCon with CNCF?

Cloud Native Computing Foundation (CNCF) is the open source, vendor-neutral hub of cloud native computing, hosting projects like Kubernetes and Prometheus to make cloud native universal and sustainable. There are hundreds of hosted projects under CNCF. Argo projects are definitely one of them because Argo Projects provides Kubernetes-native tools to run workflows, manage clusters, and do GitOps right and so much more. Argo was accepted to CNCF on March 26, 2020 and is at the Graduated project maturity level with its growing community. You can visit its website by clicking here to see all Argo projects and get information.

Herein, you can see the complete landscape of CNCF projects at below:

Interactive Website: https://landscape.cncf.io/

Argo Projects organizes community events every year hosted by CNCF. ArgoCon is basically designed to foster collaboration, discussion, and knowledge sharing on the Argo Project, which consists of four projects: Argo CD, Argo Workflows, Argo Rollouts, and Argo Events.

This event is aimed at audiences that are new to Argo as well as providing depth to those currently using Argo within their organization. The event is vendor-neutral and is being organized by the Cloud Native Computing Foundation (CNCF) Argo Community. Topics in the past have included getting started with Argo, scaling and managing Argo, lessons learned from production deployments, technical sessions, and thought leadership.

As a recap of 2022, we took our place at ArgoCon 2022 as Trendyol Group. We thought it might be valuable to relay our internal developments when CFP calls begin for ArgoCon. Our CFP application process was very easy and clear. After receiving our acceptance with our topic, we presented our work on distributed job scheduler and RBAC infrastructure story with Argo Workflows for more than 100+ teams. The event was held both face-to-face and online. With my colleague (İsmail Bülbül) we attended as a virtual speaker.

Our presentation mentioned in ArgoCon 2022 Wrap like, “İsmail and Doğukan from Trendyol flexed massive scale in a Distributed Job Scheduler Story: CronWorkflows integration with RBAC infrastructure for Over 100+ Teams.”

Here you can read the whole wrap. Kudos to all DevX team members!

You can also view our presentation at CNCF YouTube Channel

So, let’s start!

Table Of Content

  1. About Developer Experience on Trendyol
  2. What’s the current status of Trendyol with Argo Projects?
  3. Path to Being A Kubernetes Native with Argo Workflows

About Developer Experience on Trendyol

Let’s start with introducing the DevX team.

As part of the DevX team in Trendyol, we embrace the culture of constantly searching for new emerging things to augment our platform stack. Main focus of the DevX team is to increase developer productivity engineering best practices while providing smooth, high-available, and consistent platform tools experience to our internal technology customers.

If you want to learn more about DevX Team, please check this out here. Gokhan Karadas comprehensively explains about “Why Developer Experience Is Critical?”

In a nutshell, we are trying to provide unified solutions to common problems in Trendyol technology teams within collaboration with many other teams such as SRE, networking and other platform teams. Which is why exactly we decided to create a work group to adapt Argo Workflows to Trendyol at scale in 2022. But the question is here, why we needed a job scheduler and how we came up with a solution on Argo Workflows?

If you wish, let’s look at the details and see how and why.

Current Status of Trendyol with Argo Projects

First of all, we can dive into our current tech status.

In this image above, a small portion of our infrastructure metrics can be seen. Our services count is more than 9000+ and 8000+ already registered on Argo CD. We operate in large-scale, deliver our core services to millions, so we have to be sure of our needs. Our current technology status heavily relies on the Kubernetes ecosystem at the cloud-native layer, currently we have more than 400+ clusters. As Trendyol, we are one of the fastest adaptors of Argo Projects and nearly maintain all Argo product families internally such as Argo CD, Argo Workflows, Argo Rollouts, and Argo Events at the same time.

All of our infra metrics are publicly reachable for those who are curious about Trendyol’s infrastructure. So, please feel free to inspect. There is lots of information in real-time about regions, memory usages, CPU utilization, VMs, clusters, and many more infrastructure related things that power Trendyol’s platforms.

Here you find the all metrics in real-time: inframetrics.trendyol.com/

Path to Being a K-Native with Argo Workflows

But how did we get here and how did we reach the Argo Workflows path?

We all know that maintaining and scheduling service jobs in big technology companies could be pretty challenging. When you have such conditions, innovation is surely indeed. In our situation, at first, teams were keeping their jobs on different platforms and environments. Failures and downs were constantly rising, runners were blocking and CPU utilization was rising, also there was no clue of visibility. Jobs had an issue of lack of stability, manageability, and effectiveness. We have more than 1000 scheduled jobs distributed on different tools and solutions, so handling CronJobs in such a way was overhead for us. This was increasing our error rates and eventually causing outages. In time, developers’ needs dropped into our backlog because DevX should’ve been improved. Eventually, our search for a new platform started.

Evaluation Criteria

The first criterion was to handle and track jobs without issues and complexity of configurations. The unified solution for distributed job scheduler needs, ease of dealing with scalability, and disaster recovery scenarios. We identified the direct platforms where we could provide this and started to examine it in detail for comparison. Compared and scored various platforms for days. Then analyzed different platforms to manage workflows without pain for more than 100+ and more than 1000 jobs.

Airflow became the second platform we gave importance in comparison, as an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow was heavily built around for finite batch workflows. While the CLI and REST API do allow triggering workflows, Airflow was not built for infinitely-running event-based workflows and smooth containerization. Airflow supports Kubernetes as well as other third–party integrations but not a part of its ecosystem directly. Also, it is not parallel scalable and deployment needs extra effort, which differs upon the cloud environment you choose.

Reasons for Argo Workflows

  • The Kubernetes native workflow tool enables you to run each step in its own Kubernetes pod
  • Easy to scale it because it can be executed as parallel
  • Workflow templates offer reusability
  • Similarly, artifact integrations are also reusable
  • DAG is dynamic for each run of the workflow
  • Low Latency Scheduler
  • It has event-driven workflows

Reasons for Airflow

  • It enables users to connect with various technologies
  • It offers rich scheduling and easy-to-define pipelines
  • Pythonic integration is another reason to use Airflow
  • You can create custom components as per your requirements
  • Allows rollback to the earlier versions because workflows are stored
  • Has a well-tuned UI
  • Several users can prepare a workflow for a specific project

Pros of Argo Workflows

We mark some of the advantages that Argo Workflows served us. Below you can see some of the benchmarks and score metrics we made at the beginning of our project group. After considering all these technical comparisons and our needs, it was an obvious choice for us to go with Argo Workflows.

So, at the end, our first attention was to find a Kubernetes-native platform, because our entire infrastructure is Kubernetes based. Argo leverages the Kubernetes engine for workflow synchronization, and the configuration file uses the same syntax as Kubernetes. Finally, we decided to head into Argo Project, because we were already part of the Argo ecosystem with Argo CD, so, the mindset is the same. With this platform, we won all of these and the rest is beauty, thanks to the Argo community!

Conclusion and Closing

In Trendyol our tech stack keeps growing constantly, so, we want to keep our footprints smaller everyday. We want to minimize effort to maintain these tools so the Argo Workflows is based on CRDs on Kubernetes we can keep all the configs as YAMLs and its state stored on k8s. We also wanted to keep CronJobs on containers to use resources effectively.

In this part, we tried to give a brief introduction of who we are as DevX, our needs for a internal job scheduler, and our comparisons/scorings, and in the follow-up part of our series, we will discuss about how we prepared this architecture, how its succeeded, how many internal development teams migrated to Argo Workflows, and how we involve other teams?

Until then, take care!

--

--