Your Code Will Fail (but that’s ok)
Even if you’re the best engineer on the planet, your code will fail, and it’s probably going to fail in production. It’s ok. It’s just the way of the world; it’s why we have debuggers, try/except blocks, and SREs. It might be because of an errant data type, a network problem, or missing values. It might be your fault or it might be someone else’s fault. But it’s going to happen, and you’re going to spend a lot of time fixing it.
What happens when your code fails? If you’re lucky, it’s a small UI bug or it only affects some of your users. More often than not though, it’s going to impact other code: applications are built on cascading, interweaved functions, and if one fails, chances are something else is going to break. Handling this infinity of potential failures — testing, combing through logs, debugging network problems —is called negative engineering. It’s often ugly, frustrating, and repetitive, so engineers rarely enjoy doing it, but it’s a critical part of the job.
Your code is a workflow
One of the reasons that negative engineering exists — and why it eats up so much of the productive time engineers have — is that a lot of code is really a workflow, or a sequence of steps. For example, systems like warehouse ETL, ML pipelines, billing and invoicing, and CI/CD are based on tasks that must be triggered in order. If one task fails, the rest are usually bound to fail with it. Engineers may not usually think of their code as a workflow, and that’s why errors can be so costly and hard to fix, because workflow semantics are quite different than traditional engineering ones.
Let’s say we’re DigitalOcean, and we rent out servers to developers. At the end of every month, we need to invoice our clients. We’ll need to start by figuring out how much to charge based on product usage, then create invoices, and finally send them out to our customers. The code might be split across many smaller functions, but we can describe the process with a straightforward workflow:
- Aggregate usage data to calculate amounts owed — dig into our usage data, find out which resources our customers used, and calculate how much they cost over the course of the month
- Create invoices from aggregated data — split out line items, aggregate total cost, and add in user details like address and email
- Send those invoices to clients via the app and email — send invoices out to customers and make them available in the web UI
What happens if any of these steps doesn’t go smoothly? The entire process fails. If someone changed the name of the product usage table from
product_usage_facts, the first task will throw an error and no invoices will get generated or sent out. That’s why handling these errors is so critical: they’re rarely self contained, and almost always cause something else — something important — to fail too.
Billing and invoicing isn’t the only kind of code-driven business process that resembles a workflow. If you squint hard enough, most of the code that powers the applications we use — especially data and ETL related tasks, but also beyond that — are really workflows behind the scenes. But the tools that we use to build and manage our code are rarely built to take advantage of that fact.
Workflow tools, but for code
When you think about good workflow tools, you should think about how they work when things go wrong. Workflows are only interesting when things fail; they’re kind of like insurance, or risk management for code. Assuming things will go wrong — and they will — the right workflow tool should make it quick and simple to handle those “wrong things” and direct you towards how to fix them.
Let’s go back to our billing system example. Our first task starts by pulling data from our product usage database via SQL, and then aggregating and processing it in Pandas. But last week, an engineer changed the table name from
product_usage_facts and accidentally forgot to notify the data team. Our SQL throws an error because it’s targeting the wrong table name. What happens downstream?
Since the billing system tasks are all dependent on previous steps, a failure in the first task (data aggregation) cascades down to the remaining two tasks, and they fail too — one because of missing data, and one because of missing invoices. In practice, that means you’re getting bombarded with multiple failure notifications, you’re getting paged, and your stakeholders aren’t happy. Because your code is just a single block, you’re stuck finding where the error happened before you can even get to why it happened.
This is where workflow management shines. If your code is separated into discrete, interacting blocks organized as a workflow, it’s much easier to isolate where the error occurred and separate between code failures and workflow failures (i.e. the original error and subsequent failures). This kind of organization can save precious time and get your critical systems fixed faster when you need them most.
From that perspective, risk management for your code is really about three increasingly-deep levels of needs. A completely different person or team might be responsible for handling each one, and the team-to-team communication required — like a data scientist finding a bug and raising it to engineering — makes things even harder.
- Locate the error — which unit of business logic caused the problem? Where is it?
- Identify and understand the error — what went wrong and why?
- Fix and avoid the error — what’s the solution? How can we avoid this in the future?
This is the process that pretty much everyone goes through when something fails. Figuring out what went wrong can be like searching through a forest at night; the right workflow software won’t do it for you, but it can give you a flashlight.
It’s absolutely critical that engineers and data scientists invest in risk management for their code and get that flashlight. If you’re the only person involved in writing and using your code, you might be so lucky as to find and debug your own errors; but more often than not, the person who discovers a code failure isn’t the same person who wrote it. You need to locate, identify, and fix your errors as quickly and efficiently as possible to maintain uptime and keep your stakeholders happy, and organization is the first step.
The problem with modern workflow tools
Software for automating workflows has existed for literally decades (remember SAP?), but it’s usually built for business users. Cron was released by Bell Labs all the way back in 1975, and people like you and I have been struggling with it ever since. When things really started to change was when Airflow started picking up steam (pun… sort of intended). Airflow (or Luigi, but that’s for another time) pioneered the idea of workflows and workflow configuration as code.
This idea meant that the code and its real-world impact could share a similar structure. Instead of one inscrutable file containing all three steps of the invoice workflow, the code itself could finally mirror the workflow that it represented: three discrete blocks, whose interactions were governed by a workflow management system.
Today, Airflow is one of the most popular solutions for data teams to schedule and monitor their ETL workflows, and is a top level Apache project. Airflow is great, but far from perfect; most critically, it’s not really a workflow tool. To understand why, think about how you’d draw a workflow. It would probably look something like this:
The boxes represent the work, your code, that needs to be done. In our original example, the first box aggregates product usage data, the second box generates invoices, and the third box sends them. The arrows represent “moving” to the next step: the sequence and rules. Workflows are just combinations of tasks and sequences; of objectives and rules.
Great software is about the boxes — giving you new tools and abilities to build great things. Great workflow software should focus on the arrows: about handling what happens in between, in the ugly parts and spaces, and when things go wrong. The negative engineering.
But most modern workflow tools don’t do that: instead, they dictate how you should build your boxes. This misplaced understanding of their purpose has lead to enormous user frustration. For example, building systems in Airflow requires you to construct your boxes in specific, Airflow-native ways, limited by the toolkit Airflow exposes. This means the benefit of having software that handles arrows well is outweighed by having to compromise the utility of your boxes.
A great workflow tool should adapt to how you work, not force you to adapt to it. That’s why we built Prefect. We spent years designing a lightweight API that works with your existing code. We focus on the arrows — the elements unique to workflow orchestration — and invite our users to bring any boxes they want. Our job is to make sure everything works well together, not to dictate how you build your software. That’s why we built a flexible system that can generalize to your use cases without compromise.
Our users range from small startups to giant corporations, from professional baseball teams to national space agencies. No two workflows are alike, even within the same organization, but Prefect lets them all run with confidence. Our users write amazing code, and we’re privileged to assist them.