Positive and Negative Data Engineering
Prefect has a culture of transparency, which involves sharing news — both good and bad — with our employees and investors. In keeping with those values and the spirit of open-source, we’d like to also include the broader community when possible. This post is excerpted from our August 2018 investor update.
Hello, world! Our team is excited to announce Prefect, a (soon-to-be) open-source framework for building robust data infrastructure. Prefect was inspired by observing frictions between data engineers and data scientists, and solves these problems with a functional API for defining and executing data workflows. Prefect is currently being used by our Lighthouse Partners — email us if you’re interested!
We’ve got a lot to share, but I want to kick off this blog by revisiting our team’s motivation and what we’re so excited about.
The biggest problem data engineers face is a Sisyphean task that we call “negative data engineering.”
- Positive data engineering is what we typically think engineers do: write code to achieve an objective.
- Negative data engineering is when engineers write defensive code to make sure the positive code actually runs. For example: what happens if data arrives malformed? What if the database goes down? What if the computer running the code fails? What if the code succeeds but the computer fails before it can report the success? Negative engineering is characterized by needing to anticipate this infinity of possible failures.
Engineers tell us they typically spend 90% of their time on negative or defensive issues, and just 10% on the positive solutions they were hired to build. That means there’s extraordinary leverage in focusing on negative engineering: if we can reduce the negative share to just 80%, we can effectively double engineers’ positive productivity, because they can spend 20% of their time on functional code.
However, if you look across the data landscape, you’ll see hardly any acknowledgment of this problem. Most people will tell you it’s simply too hard for a third party to solve these issues because, by their very nature, they are so specific to a company’s unique business practices.
At Prefect, we know better. Thinking about a multitude of implausible but critical negative outcomes is something I’ve done my entire career as a risk manager. In risk, one doesn’t attempt to predict and hedge every possible result; instead, one develops a repertoire of concepts and tools that is specific enough to be useful, but generalized enough to robustly handle the unknown. One of the key requirements to doing that effectively is being able to gather relevant experience as fast as possible.
For the past three years, I’ve been a PMC member of Apache Airflow, the most widely-used open source software for data engineering workflows. In addition to giving me valuable insight into a variety of technical challenges, that also means I’ve received thousands of emails from data engineers and scientists looking for help with their problems. Through those conversations, I gained particular insight into the negative engineering problem. In isolation, each issue does, in fact, appear unique. But in aggregate, striking patterns appear: the same universal problems manifesting over and over. For a long time, I attempted to solve these issues within the confines of Airflow; when I reached Airflow’s limits, I started designing Prefect. That was almost two years ago.
Today, Prefect is the codification of the patterns we observe in modern data engineering. We’ve worked very hard to build a system that can automatically enable best practices, even for data applications it’s never seen before. To see how this works, consider how you can immediately recognize that “the sky is blue” is right and “sky is blue the” is wrong without memorizing every combination of words in the English language. Just as your brain has broad rules for language, Prefect can detect when something is wrong even when we can’t pinpoint exactly what or why. This capability makes negative engineering much easier and saves our users unbelievable amounts of time and headache.
Prefect is an exercise in simplicity. Negative engineering problems are not always complex, or sophisticated, or difficult. On the contrary: they are often minor, annoying, and repetitive. Consequently, they fall through the sieve we use to identify major issues, even though their aggregate impact is extraordinary. At Prefect, our discovery has been that most data applications can be decomposed into a simple vocabulary, and by focusing on those basic building blocks, we can solve negative issues without sacrificing any of the power or sophistication that positive engineering demands. Our users are granted a creative license to combine those blocks in fascinating and unexpected ways, and Prefect serves as the lighthouse keeping them safe.
At our core, we provide two things. One is our open-source framework, which operates like a hardware store: stocked with all the necessary components for building great data applications. The other is our platform logic, which we think of as the store manager: guiding users to the right tools and making sure their projects are successful. With these two things working together, we can offer a compelling solution for both positive and negative engineering problems.
We’ve posted a brief technical introduction to Prefect here, and can’t wait to share more very soon.