Airflow alternative for databases: Kaldea’s single DAG scheduler

Yaekyum Lee
Kaldea
Published in
5 min readOct 3, 2022
Kaldea’s scheduler, no-code, SQL, and UI based

Why create a single DAG based scheduler for databases?

When we set out to design Kaldea, including its single DAG based scheduler, we interviewed 100+ data scientists, data analysts, and data engineers to understand the pain points in their day-to-day data analysis workflow. We identified several pains that Kaldea sets out to solve:

  • Lack of context
  • Disconnection between data assets
  • Dependency on multiple tools, becomes a bottleneck to fast, simple, and even fun analysis
  • At the same time, we also spotted a gap in creating schedules.

Today inside companies

We learned about two major approaches companies have chosen to manage schedules:

  • An internal process-based ticketing system, in which analysts and scientists make a request to data engineers
  • A more distributed system, in which data engineers empower data analysts and scientists to use Airflow (or other workflow manager) to create and manage schedules

The ticketing system: analyst and system request to data engineering

The first approach, the internal ticketing system, was mainly chosen for

  • governance
  • because there were enough data engineers to support it
  • because it was difficult to enable teams outside of data engineering to manage schedules through Airflow

Empowering the analyst and scientist to use Airflow

The second approach was mainly driven by

  • a lack of data engineering resources
  • because the data leader was willing to enable a larger audience to replicate data engineer’s existing responsibilities of managing schedules

The second approach allows data engineers to focus more heavily on work related to extraction and load, while scientists and analysts manage transformation. In fact, when we interviewed data engineers, they viewed job scheduling as a secondary part of their role, not contributing to their career, and, essentially, rote labor being neither difficult nor creative.

So why would the same work be of value to data scientists and analysts, even if it is more directly related to their work? How is it valuable for data scientists and analysts to write better python to manage multi DAG dependency, when it isn’t for data engineers?

Automating commoditized administrative scheduling for databases

The more we looked into it, job scheduling seemed like complex but commoditized administrative work that should be automated as much as possible. So we created a DAG job scheduler for databases that is both SQL and UI based, where a multiple job dependency is automatically managed because we designed a single DAG job scheduler.

Comparison between Airflow and Kaldea’s single DAG based scheduler

In the rest of this article, then, we’ll compare job scheduling in Airflow to Kaldea’s single DAG based scheduler. There are several alternatives to Airflow but we’ve Airflow is the most commonly used and the most representative.

Airflow for your data work

Pros

  1. General purpose: Airflow can be used for any type of scheduling.
  2. Well-known: Airflow has a large support community and comes with an easy learning curve for new employees who probably already have used it at their previous companies.
  3. Large ecosystem of drivers and plugins: Airflow has a wide variety of libraries that allow you to implement scheduling with multiple systems (e.g. BigQuery).
  4. Large coverage: Airflow can be utilized to cover not just scheduling but also data quality management.

Cons

  1. All code based: All of the pros above are based on you writing code, and sometimes very well.
  2. High barrier to entry for non-coders: For non-coders, the entry barrier can be quite steep and resource intensive to support.
  3. Designed for developers: While this may be less concerning when it comes to its focus area, it is clear Airflow is and will continue to be made for developers. It isn’t designed for data scientists or analysts.
  4. Quality variance: Airflow itself does not have an issue with quality variance, but the variance in quality will depend on who writes the code.
  5. Dependency management: Airflow is a great tool for managing single job dependencies but it is extremely difficult to understand multiple job dependencies. This becomes even harder with varying code quality and styles. In addition, Airflow itself does not guarantee that dependencies work according to plan. Especially when there are multiple job dependencies, your code needs to guarantee that the dependencies work according to plan.

Kaldea’s scheduler

Ready to offload administrative scheduling for your database and outsource multiple dependency management to Kaldea?

Pros

  1. Ease of use and SQL base: Kaldea’s scheduling is based on SQL — no other programming is needed. In addition, scheduling is managed through our friendly user interface.
  2. Average quality for scheduling is high: Ease of use also means fewer errors! Fewer errors mean that the average quality of jobs created is high. (Check out the 10-second demo!)
  3. Multiple job dependency management: Kaldea does the tracking for you when it comes to multiple job dependencies and, therefore, makes job validation easy.
    - We view the entire job (relating to databases) of a company as one large DAG. Each separate job on Kaldea is a sub-DAG.
    - The one large DAG design allows Kaldea to guarantee DAG based total order.
    For example, if you have Job A scheduled for 10:00 and a dependent Job B scheduled for 10:15, Kaldea will detect the dependency and ensure Job B runs after Job A, even if that means that Job B starts at 10:20. In Airflow, you are responsible for creating the validation logic (sensor). In Kaldea, we do it for you.
Multiple schedule dependency, automatically managed!
  1. Performance: Kaldea relies on your DWH or DataLake, so you are free from the setup and can have high reliability.

Cons

  1. Not built for general purpose: Kaldea currently supports database related jobs only. We do have plans for a wider purpose product but today you will have to use Airflow for other purposes.
  2. New to the market: Kaldea’s job is easy to learn and utilize with high average level of quality; however, it is still a new tool to adapt. In addition, compared to existing tools like Airflow, Kaldea lacks a wide audience of references.
  3. Data validation: Kaldea does not yet have data validation embedded.

Our roadmap to better support our customers

  1. Data validation
  2. Airflow integration: We plan to develop a Kaldea library where you can utilize Kaldea through Airflow, enabling you to seamlessly utilize Kaldea for database related jobs and Airflow for general and other purposes.

If you are considering Kaldea’s based scheduler

For data scientists and analysts

  1. Utilize Kaldea’s scheduler to finish your analysis independently, quickly, and with
  2. Create schedules on your own without the learning curve to guarantee the total order of jobs and all other tribal knowledge based schedules in your company

For data engineers

  1. Utilize Kaldea’s scheduler, enable your stakeholders like data scientists and analysts to create and manage Transformation, while you focus on Extract and Load
  2. Do less translation of someone else’s SQL script to Python
  3. Outsource the multiple job dependency to Kaldea and free yourself from the pain of tracking.

--

--