Introducing Kestra: Finally a Viable Airflow Alternative?

A basic hands-on guide to a new data orchestration platform — Kestra.

Published in

Geek Culture

12 min readJun 24, 2023

I’m a huge Apache Airflow fan. I mean, I’ve written 10 articles about it and I’m using it at my day job. Still, that doesn’t mean I think it’s perfect. It has a couple of pain points, which is why I’m constantly looking for a new viable Airflow alternative.

Kestra is an open-source data orchestration platform that might be what I’m looking for, and this article will show you how to install Kestra and how to create and schedule a couple of data flows. The upcoming articles will compare Kestra to other data orchestration platforms, so stay tuned.

Will Kestra be the perfect Airflow alternative? Let’s find out.

What is Kestra and Why Should You Care?

First things first, what is Kestra, and why should you consider it if you’re already using Airflow?

In the most simple terms, Kestra is an open-source declarative data orchestration platform that aims to make data workflows accessible to more than just data engineers. It comes with a declarative YAML interface, which means almost anyone in your organization can participate in the data pipeline creation process.

The tool packs a wide range of plugins which means you can easily work with different cloud providers (AWS, Azure, GCP), databases, file systems, Git, Kafka, Spark, Kubernetes, PowerBI, and pretty much anything else you can imagine.

Kestra addresses many Airflow shortcomings, such as:

Scalability from a developer standpoint
API and event-driven workflows
Task failings due to heavy workload
Challenging Python environment management for non-Python users
General team-level isolation and sensitive data management

That doesn’t mean Kestra is perfect, though. The biggest potential drawback for some users is that data pipelines are written in YAML — not in Python. You can write individual Python tasks, of course, but you’ll have to paste the Python code into the YAML document. It’s a different convention you’ll have to get used to.

It’s not technically a big issue because you’re only supposed to write simple Python calls directly in YAML. For complex scripts, you can combine the Git plugin and Docker images. It’s out of the scope of today’s article, but definitely something worth exploring in the future.

In addition, going from Python to YAML means a lower barrier to entry for less-technical users or engineers working in different tech stacks.

To put it simply — it’s just different. You’ll see how everything works later in the article, so you’ll be able to make the final judgment for yourself.

But first — let’s see how you can install Kestra on your system.

How to Install Kestra

The easiest way to install Kestra is through Docker, so make sure to have it installed before proceeding.

Once installed and running, open a new Terminal/CMD window and run the following command:

curl -o docker-compose.yml https://raw.githubusercontent.com/kestra-io/kestra/develop/docker-compose.yml

It will essentially download a docker-compose.yml file that instructs Docker on how to build your containers. You can take a look at the contents of this file by running the cat command if you’re on Unix.

Here’s what you’ll see:

Image 1 — Kestra docker-compose file (image by author)

Overall, think of this as an exact step-by-step environment creation and configuration instruction.

The best part? It only takes a single line of code to run it:

docker-compose up -d

The optional -d flag will make sure Kestra runs in the background. Here’s what you’ll see:

Image 2 — Starting Kestra (image by author)

You can see that pulling and running various Docker images took a couple of minutes on my machine. How much it will take on yours depends on your internet speed.

As soon as the Docker command finishes, it means Kestra is running in the background. You can access the Web UI by opening localhost:8080.

This is what you will see:

Configuration finished! Let’s now write your first data flow.

How to Create Your First Kestra Flow

The sidebar menu in Kestra UI has a lot to offer. We’ll only focus on the “Flows” and “Executions” tabs today.

To start, click on the “Flows” tab — here’s what you’ll see:

Image 4 — Your Kestra flows (image by author)

This one is all about creating and managing data flows. You can create flows in different namespaces, but that’s a topic for another time.

For now, click on the big purple “Create” button at the bottom right corner of your screen.

You’ll be presented with the YAML editor. Don’t think much about it and just copy the following code — I’ll explain everything in a bit:

id: first-flow
namespace: dev
inputs:
  - name: firstname
    type: STRING
    defaults: User
    required: false
tasks:
  - id: hello-task
    type: io.kestra.core.tasks.log.Log
    message: Hello, {{ inputs.firstname }}

This is how your screen should look like:

Image 5 — Writing the flow YML file (image by author)

Now onto the explanations. Let’s go over the code line by line:

id — This is the identifier of your flow. You can’t have multiple flows with identical IDs in the same namespace, so keep that in mind. It’s perfectly valid to have identical data flows across multiple namespaces, such as dev, test, and prod.
namespace — Exactly what the name suggests. Use it for flow organization.
inputs — Parameters or variables you can change inside the execution context. These need a name and a type at the minimum. You can also specify other key-value pairs, as you can see in Image 5.
tasks — A runnable task that handles computational work in your flow. Each task must have an id and a type. Think of these as parts of a DAG in Airflow. By default, the tasks execute sequentially, but that’s something I’ll show you how to change in the upcoming articles.

The task you see implemented in the code has the job of logging a “Hello World” type of message. But instead of “World”, it will log the name of the user passed as a parametrized input.

If you ever get lost or don’t know what’s possible to do with a certain task type, you can always open a side-by-side documentation view, as shown below:

Image 6 — Documentation split screen view (image by author)

But we’ve kept things pretty simple, and the entire YAML file is pretty much self-explanatory, so there’s no need to reference the documentation.

The next step is to Save the flow, which you can do by clicking on the big purple “Save” button at the bottom right corner:

Image 7 — Saving the flow (image by author)

Only after you save the flow you’ll be able to run it, so keep that in mind.

Run Your First Kestra Flow

Your Kestra flow is now saved, which means you can run it. Click on the “New execution” button located at the bottom right corner.

You’ll see the following modal window appear — this one is here because we have an input variable. Enter your name, or anything else you want, and click on “Execute”:

Image 8 — Running the flow (image by author)

You’ll be redirected to the flow execution page (Note the left sidebar menu). The Gantt view shows you how much time each task in the flow has required to run.

Overall, the entire flow finished successfully in 3.06 seconds (Green bar color indicates success):

Image 9 — Flow execution Gantt view (image by author)

You can go into the Logs tab to inspect the output of your tasks. Click on the arrow icon to the left of the white LOG rectangle to expand the contents:

Image 10 — Flow execution Log view (image by author)

And that’s what it takes to write and run Kestra flows.

Up next, we’ll explore what happens when you introduce Python into the mix.

Python Tasks in Kestra Flows — How to Get Started

As mentioned earlier, Kestra can’t run Python from predefined functions. Instead, you need to put your Python code right into the YAML file.

That’s exactly what you’ll do next. We’ll build a simple data pipeline that downloads the data, parses it, saves it, and prints it to the console. Let’s go over the logic next.

The Logic Behind Our Simple Data Pipeline

Here’s a list of tasks our data pipeline will implement:

Download the data — We’ll use a free REST API to download dummy users' data in JSON format.
Load and process the data — Once downloaded, we’ll use Python to load the JSON file, keep only a selection of attributes, and write it back to disk in CSV format.
Print the data — We’ll access the outputFiles property from the previous task to get the location of the CSV file and then use the cat command to print it.

Sounds simple, but let’s take a minute more to talk about the data. It’s coming from a completely free REST API and represents a list of 10 dummy users:

Image 11 — Contents of the JSONplaceholder free API (image by author)

From this array of objects, here’s what we want to keep for each user:

ID
Name
Email
Address (City — Street, Suite)
Phone
Website

Let’s now dive deeper into the code.

Writing Python Code in a YML File

As before, copy the following code snippet into a new Kestra task:

id: python-task
namespace: dev
tasks:
  - id: downloadData
    type: io.kestra.plugin.fs.http.Download
    uri: https://jsonplaceholder.typicode.com/users

  - id: processData
    type: io.kestra.core.tasks.scripts.Python
    outputFiles:
      - usersCsv
    inputFiles:
      data.json: "{{outputs.downloadData.uri}}"
      main.py: |
        import json
        import pandas as pd
        from kestra import Kestra

        # Read JSON file
        with open("data.json", "r") as f:
            data = json.load(f)
            
        # Keep certain attributes
        df_src = []
        for r in data:
            df_src.append({
                "id": r["id"],
                "name": r["name"],
                "email": r["email"],
                "address": f"{r['address']['city']} - {r['address']['street']}, {r['address']['suite']}",
                "phone": r["phone"],
                "website": r["website"]
            })
            
        # Convert to pd.DataFrame and save
        df = pd.DataFrame(df_src)
        df.to_csv("{{outputFiles.usersCsv}}", index=False)
    runner: DOCKER
    dockerOptions:
      image: ghcr.io/kestra-io/pydata:latest

  - id: printData
    type: io.kestra.core.tasks.scripts.Bash
    inputFiles:
      data.csv: "{{outputs.processData.files.usersCsv}}"
    commands:
      - cat data.csv

Here are the explanations for each task:

downloadData — It uses a Download plugin from Kestra to download a file from the web. Simple!
processData — A Python task responsible for loading the downloaded file (path obtained from the URI property of the previous task), and then uses Pandas to process the dataset and to write it locally as a CSV file. For the file location, we’re using outputFiles to keep track of a single property — usersCsv. Think of it as a variable that will keep track of the file location.
printData — It uses the Bash task type from Kestra to print the contents of a CSV file to the console. The path to the CSV file is passed from the previous task, as shown in the inputFiles property list.

Just like before, make sure to save the flow by clicking on the “Save” button:

Image 12 — Contents of a Python task flow (image by author)

That’s everything we need, so let’s run the flow next.

Running the Kestra Flow with a Python Task

To run the flow, simply click on the “New execution” button located in the bottom right corner. You’ll be redirected to a Gantt view, where execution time is shown for each task:

Image 13 — Workflow execution Gantt view (image by author)

As you can see, all of the tasks are green which means no errors were raised during execution.

Switch to the Logs tab to expand on the contents of the last task. You can see the 10 parsed users printed out:

Image 14 — Workflow execution Log view (image by author)

That’s all great, but do you have to run the flows manually every time? There has to be a better way.

Kestra allows you to schedule your flow runs, and that’s a topic we’ll explore next.

How to Schedule Kestra Flows with Cron

Kestra comes with a simple Cron scheduler out of the box. There are other ways to schedule flow runs, but this is the simplest one by far.

Simply add the triggers property at the end of your YAML file and add a new trigger of type Schedule. Then, use specify a Cron expression to determine the scheduling. For example, the one below will run the flow at minute 0 of every hour of the day:

triggers:
  - id: schedule
    type: io.kestra.core.models.triggers.types.Schedule
    cron: 0 * * * *

The entire YAML file now looks like this:

id: python-task
namespace: dev
tasks:
  - id: downloadData
    type: io.kestra.plugin.fs.http.Download
    uri: https://jsonplaceholder.typicode.com/users

  - id: processData
    type: io.kestra.core.tasks.scripts.Python
    outputFiles:
      - usersCsv
    inputFiles:
      data.json: "{{outputs.downloadData.uri}}"
      main.py: |
        import json
        import pandas as pd
        from kestra import Kestra

        # Read JSON file
        with open("data.json", "r") as f:
            data = json.load(f)
            
        # Keep certain attributes
        df_src = []
        for r in data:
            df_src.append({
                "id": r["id"],
                "name": r["name"],
                "email": r["email"],
                "address": f"{r['address']['city']} - {r['address']['street']}, {r['address']['suite']}",
                "phone": r["phone"],
                "website": r["website"]
            })
            
        # Convert to pd.DataFrame and save
        df = pd.DataFrame(df_src)
        df.to_csv("{{outputFiles.usersCsv}}", index=False)
    runner: DOCKER
    dockerOptions:
      image: ghcr.io/kestra-io/pydata:latest

  - id: printData
    type: io.kestra.core.tasks.scripts.Bash
    inputFiles:
      data.csv: "{{outputs.processData.files.usersCsv}}"
    commands:
      - cat data.csv
triggers:
  - id: schedule
    type: io.kestra.core.models.triggers.types.Schedule
    cron: 0 * * * *

Or in Kestra UI:

Image 15 — Adding a schedule trigger (image by author)

The benefit of scheduling is that you don’t have to run the flows manually. They’ll run based on the schedule rule you’ve specified.

Go to the Executions tab in the Flows menu to verify. You’ll see the letter “S” under “Triggers”, which means the flow was triggered by a scheduler:

Image 16 — Successful run of a scheduled flow (image by author)

And that’s how you can schedule flows in Kestra.

We’ve explored the basics today, so let’s make a brief recap and a pro/con list next.

Kestra Impressions — Pros and Cons

This article showed you how to install Kestra and how to get started by building two data flows. You’ve also seen how to schedule your flows with Cron, which is something you’ll likely apply to all of your flows.

That being said, what are some pros and cons of using Kestra? Is it a viable Airflow alternative? Here are my thoughts.

Pros:

Workflow management is accessible to the entire organization (doesn’t depend on knowing Python like with Airflow).
The user interface is fantastic, easy to use, and intuitive — superior to the one provided by Airflow.
Kestra is free to use but has paid scalable plans for enterprise users.
Easy flow separation through namespaces, which makes testing code in different environments a breeze.
You can expose workflow components via the built-in REST API, which means third-party systems can interact with Kestra.
Having complex Python scripts on GitHub means you can change the business logic on the fly — no need to redeploy your Kestra flow.

Cons:

Writing Python inside a YAML file isn’t the most pleasing thing in the world.
It takes time to get accustomed to the tool and get familiar with all the options it has to offer.
The documentation could cover the concepts in more depth and provide more advanced examples.

Will it replace Airflow in your organization? As with everything, it depends. Kestra is amazing if your company isn’t centered around Python and if people from other departments will help you design the workflows. You can also interact with it through the built-in REST API, which is something Airflow doesn’t offer. The decision is ultimately up to you.

Stay tuned for a more in-depth comparison to other data orchestration platforms, such as Airflow and Prefect.

Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link - Dario Radečić

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

medium.com