Airflow’s dashboard was testing our patience, so we made our own
TL;DR: Creating/editing a pipeline visually is impossible in existing tools and it makes data engineering a chore. This new approach, built on top of a lightweight API, opens up the possibility to rapidly iterate pipelines. Also, it loads really fast.
Shortly after creating our first product Houston, we realised it would need a GUI. Houston is a tool for orchestrating workflows or ‘pipelines’ via an API, which makes it a serverless and platform agnostic alternative to Apache Airflow or Luigi. Just like these existing tools, we required a GUI to help data engineers make sense of the often monstrously complex pipelines they create and enable them to actually control what they’re doing. This is a more important part of an engineer’s toolkit than I think most people give credit for, and one for which the true potential hasn’t been fully realised.
The Existing Tools
Luigi is a Python module, and comes with a web dashboard component (if you host it yourself) that you can use to track the progress of multiple pipelines.
The dashboard doesn’t give you the ability to control your pipelines, i.e. start, stop, or edit them. When my team was using Luigi we ended up building our own application for triggering and monitoring, and tended to look at the logs of each pipeline rather than the dashboard to get a more informative view of the pipeline’s progress.
More recently I’ve been using Apache Airflow with Google Cloud Composer. This wasn’t by choice per se, but rather because this is the only orchestration tool offered by Google Cloud. This does provide the GUI with a persistent database to store the progress of each DAG (directed acyclic graph; pipeline), as well as the full run history for each one. Unfortunately, its reliance on this database causes Airflow’s UI to load painfully slowly, and take a very long time to update the visual with task progress. Navigating to the DAG you need to check on takes a long time, and often you’re met with a ‘504 GATEWAY TIMEOUT’ or ‘Ooops.’ screen with some nice ASCII art of an explosion.
Airflow always seemed so disconnected from the actual processes that it was running, in part because it didn’t update itself in realtime, which is a shame since the processes are all within the same cluster and should be very easy to monitor.
It’s also an expensive tool since it’s running its web container and database all the time in a Google Kubernetes cluster, which seems unnecessary for something as simple as displaying a DAG; a serverless solution would be far more convenient.
Despite these tools providing graph visualisations for pipelines we often found ourselves drawing them out by hand to better understand the structure, and to explain what was going on to project managers.
There was a point where a senior manager was very keen to see our pipeline growing in complexity and asked for more and more annotated diagrams of our pipeline so that he could present it to his boss as evidence of the huge amount of work we’d done. It was also the primary visual used to explain how our project worked. The pipeline had about 60 separate tasks at that point and required a 1 hour session to explain to any new joiners to the project. It was a monster.
I can’t believe looking back that we somehow created the entire dependency graph in the code itself. Luigi and Airflow have no way of editing the DAG in a visual way. You just have to stare at your code until you think you’ve figured it out, and then run it with your fingers crossed!
Building Houston’s GUI
Even before Houston was created I was imagining better ways of visualising, controlling, and monitoring pipelines. I had a few goals for what this solution would do:
- Load quickly
- Update itself live
- Be intuitive for technical and non technical users
- Look good enough that we won’t want to draw it out separately
- Provide the user with actual control over what’s happening
I decided to build the GUI with a combination of React, React Spring, and D3.
After suffering through countless ‘gateway timeout error’s with Airflow I was determined to make it the speediest GUI possible. I have the app wait until after the initial load to do any API calls. It then makes as many asynchronous calls as it needs and renders every plan and mission it finds even before they’re completely loaded. This reduces the apparent load time to about 220ms, though it takes anywhere from 800–2500ms to finish loading all the data.
I used React spring to animate transitions for the many state changes, which really helps to make the GUI feel fluid and responsive. I’d recommend anyone to check out the examples just to see what it can do. I was able to make icons representing missions animate into different positions depending on how the user is viewing them, and have missions in a list smoothly reorder themselves as as more data was loaded.
The graph itself is drawn with a Frankenstein combination of dagre.js and a lot of D3.js. Airflow uses dagre as well, but I’ve opted not to let dagre draw the links between stages, instead drawing a sigmoid curve with D3. I have also not included the stage names in the graph drawing process, making the graphs a lot more uniform and tidy looking. In Airflow even the smallest DAGs look huge because each stage icon is sized to contain its name, so I was keen to avoid that here.
I wasn’t sure how I would handle a D3 component within a React app. I’ve seen a few different approaches but decided to keep it as simple as possible, with an aim to reduce the amount of unnecessary updates done by the D3, as re-drawing the graph every time something small changes could get very CPU intensive. I did this by creating a React reference for the view container, and re-using this reference for every graph component I created, meaning the container never needs to be removed by React.
| | Graph
| | Mission
| | | Graph
| SVG Container
| | (Graphs will render inside here)
^ A simplified view of the component hierarchy.
You could see this as increasing complexity, but the code itself is much easier to manage, since the flow of information is very simple: Plan → Mission → Graph.
The other benefit is that I can render any other components on top of the SVG container, meaning I get to use the entire screen to render the graph and don’t have to worry about fitting other components next to it, or being forced to put other components (such as the tooltip) inside the SVG. This really simplifies the css layout because I can use Flexbox and Grid instead of positioning elements manually if they were all SVGs. After countless D3 projects I’ve learnt that it’s best to use as few SVGs as possible.
When it comes to the timing of stages the Gantt chart provides all the required information, but the user loses the view of how stages depend on each other. I opted to have the stage icons from the graph layout transition into the Gantt chart. This makes it easier to keep track of where in the plan each stage sits.
Reaping the Benefits of the Serverless Approach
Houston uses a fairly different approach to Luigi or Airflow in that it’s an API and traverses your DAG independently of what its constituent stages are actually doing. This provides a huge benefit to the capability of the front end, in that the pipeline can be controlled directly through API calls without the need to access or edit the user’s code. This means the front end can be used to:
- Edit a plan (the JSON that defines the DAG)
- Create new plans
- Select stages to skip/ignore (even during a mission run)
- Trigger stages (if they have webhooks)
- Stop missions early
[sorry, some of these features are in beta at the moment]
This is huge! Editing a DAG from the front end wasn’t possible in Luigi or Airflow because the DAG itself is defined in the user code itself. In Houston this information sits in the API, and can therefore be edited visually, which is so much easier, and allows for the rapid creation and iteration of pipelines. You can also see the resulting DAG immediately, without needing to run any code.
Visualising pipelines is underrated, because it makes them so much easier to understand, monitor, and modify. Data engineering is much less of a chore when you can clearly see and influence all of your processes, and the more control you have the faster you can build and improve your workflows!
Houston is available now at callhouston.io, and has a free tier! I haven’t even mentioned the many benefits Houston has over other tools in terms of just simplicity, cost saving, and cross platform capabilities. You can read more about them here.