Data Orchestration is evolving

Why code-first, GUI-driven orchestration with Coalesce is the future

--

Preface

This article covers how Orchestra can seamlessly schedule and run your Coalesce data pipelines. It will cover how to connect Orchestra with Coalesce, how to create a pipeline, and investigates the impressive amount of metadata that Orchestra exposes to end users via lineage. This entire integration can be completed in less than 10 minutes, without having to write a single line of code.

To learn more about Coalesce visit here and their docs here ❄️

To learn more about Orchestra visit here and our docs here 🚀

What is Orchestra?

The modern data stack has seen many advancements over the past few years. These advancements sometimes take the form of new solutions entering the market, or new patterns or frameworks for development with data, such as data mesh. As data stacks and practices evolve, there is one variable within the modern data that often gets left out of the conversation: how do I orchestrate my data stack?

While orchestration may not be at the top of mind when procuring other MDS solutions, such as Airbyte or simple other tools on AWS like EC2, it can be a critical oversight to forget to consider how the tooling and data within your stack are going to work together. For example, you would never want to put several talented musicians in the same room, but not give them the sheet music they’ll be playing from. Effectively having them freestyle, or worse, guess how they are going to play together. You want a conductor who can ensure the group plays together at their highest potential.

Photo by Samuel Sianipar on Unsplash

Enter Orchestra. Orchestra is a turnkey solution for orchestration and observability that acts as the conductor for the tooling and data within your data stack. While some MDS solutions try to bolt on orchestration as part of their offering, Orchestra allows you to fully manage end to end, each piece of your data stack. Not only that, but Orchestra offers a low code, severless solution, which means saying goodbye to the headaches of managing hosted solutions like Airflow.

Orchestra allows users to focus on scalable, granular control of orchestrating each piece of the data stack, rather than spending time writing code and managing Kubernetes clusters. Because all of this happens within Orchestra, users benefit from rich metadata and complete, end-to-end lineage of all the tooling within your data stack.

Setting up Orchestra and Coalesce

Let’s take a look at this in more detail. Throughout this article, we’ll be walking through setting up the Coalesce <> Orchestra integration, and learn how Orchestra can seamlessly run all of your Coalesce data pipelines.

Coalesce is a data transformation solution built uniquely for Snowflake, that helps you visually transform your data with the full power and flexibility of code. You will need a Snowflake, Coalesce, and Orchestra account in order to recreate what you see in this article.

Configuring the integration

Within Orchestra, under the integrations section of the side menu, we can select Coalesce as the solution we want to connect to:

Orchestra requests a connection name and an access token that will be generated by Coalesce. The documentation provided by Orchestra is easy to follow, and even includes a link directly to the Coalesce documentation.

Orchestra requires OAuth to be configured for any Coalesce Environment you are connecting to. Coalesce provides an example script of what you can run within Snowflake to configure OAuth — I was able to set it up in under three minutes.

With OAuth configured within Coalesce, and your access token added from Coalesce, just like that, Orchestra will validate your connection and if successful, have your integration ready to add a pipeline to!

Creating a Pipeline

With the integration to Coalesce set up, we can now set up a pipeline within Orchestra. You will need to name your pipeline before configuring anything:

Next, you will select the type of trigger you want to use to kick off the pipeline you are configuring. Orchestra gives you four options to choose from: Manual, Webhook, Cron, or Pipeline (using another Orchestra pipeline to trigger the pipeline you are creating).

Adding a Task

With your chosen trigger selected and configured, you’ll be taken to the pipeline builder interface. You’ll need to add a task to the empty task group to set up a Coalesce job to run.

When adding a task, select your Coalesce integration, and give the pipeline task a name. You’ll then need to provide Orchestra with your Coalesce environment ID. This environment ID can be found from the “Deploy” interface within Coalesce, next to the name of any of your environments.

Once you have provided Orchestra with your environment ID, you can optionally specify the connection you want Orchestra to use, as well as a specific job ID from Coalesce.

Run the Pipeline

With the task configured, you can exit the pipeline builder and navigate to “Pipelines” within Orchestra. You’ll see your newly created pipeline. In the additional options section of the pipeline, you have the ability to be able to trigger the pipeline manually.

With the pipeline triggered, Orchestra automatically kicks off the refresh job that is configured in Coalesce. With a job completed successfully, Orchestra provides rich metadata surrounding your job, in the form of lineage.

When looking through the lineage Orchestra provides about a refresh job in Coalesce, we can see exactly how the job was run, the type of queries run, and how all of the nodes in the pipeline were materialized. Not only that, but Orchestra is able to determine how to optimally run the DAG using metadata.

My Coalesce DAG rendered perfectly in Orchestra

In the lineage included here, Orchestra provides detailed metadata around my job. First, we can see that Orchestra validates the source data for any node in the pipeline. During this, Orchestra also runs any data tests configured in Coalesce, surfacing the results within each step. These data tests will be run in conjunction with each step in the pipeline.

Then, Orchestra automatically infers which nodes should be run first, due to other dependencies being present, optimizing the pipeline for efficiency. Orchestra also surfaces the insert strategy i.e. truncate, incremental, etc. as well as the materialization of each node.

Manage Data Orchestra Better

In less than 10 minutes, I was able to integrate with Coalesce, build a pipeline, and run my data model in Coalesce. I didn’t need to write any code. I didn’t need to add additional .py files to an already complex Airflow environment. And in return, I received end-to-end data lineage which included all of the metadata around my data pipeline.

Orchestra allows users to focus and have complete control of the orchestration of data within their data stack, in a serverless, low-code interface. This means removing the need to manage additional infrastructure, piecing together complex DAGs in Airflow, or spending engineering hours writing boilerplate code. Orchestra allows you to focus on all the things you need to care about, and none of the things you don’t.

Find out more about Orchestra

Orchestra is a data orchestration and data observability platform. It’s also a build any DAG you can build in Airflow using orchestra. You can use it to solve many use-cases and solutions like swiftly building data products, preserving data quality, and even data governance. Our docs are here, but why not also check out our integrations — these are managed so you can get started with your pipelines instantly. There’s also a blog, written by the Orchestra team + guest writers, and some whitepapers for more in-depth reads.

--

--