Stories by David O'Keeffe on Medium

The Foundation of Modern DataOps with Databricks

David O'Keeffe — Mon, 16 Dec 2024 21:50:01 GMT

The DataOps Workflow for Databricks. Essentially, ELT , into your data model, then push through the lakehouse medallion architecture. — [Source: Author]

The Three Ways: The Foundation of Modern DataOps with Databricks

Unlocking Databricks’ full potential requires a new approach to data management.

Discover how “The Three Ways” — the core principles of DevOps — form the foundation of modern DataOps, and will revolutionize your data engineering practice.

In DataOps the assumptions we make in development have to be continuously verified in production — [Source: Author]

Imagine you’re the leader of a data team and you’re sure that the time has come—enough is enough— you need a fresh data platform.

Yet you can’t quite take the leap. While it promises to solve problems and deliver substantial value, the prospect of adoption feels rather daunting. The benefits don’t appear to outweigh the pain of the disruption to your team and to your operations.

True, if you bought into every vendor’s pitch, you’d spend all your time moving platforms and never delivering.

However, some technologies stand out from others, and some become essential over time. This is the lifecycle of technology adoption. There is always some friction when it comes to doing something new. It feels hard, and perhaps it is hard to “cross the chasm”, both on a personal and a professional level, but change is what is necessary for growth.

The Technology Adoption lifecycle from Crossing the Chasm – [Source: Wikipedia]

It’s why the leaders in every field are often there because they dared to be disruptive. They embrace and create change. Healthy disruption is at the heart of Databricks. For over a decade, Databricks has been stacking S-curves on top of each other and releasing them freely to help solve the world’s data problems. Spark, Delta Lake, and MLflow, (and increasingly even Unity Catalog) are ubiquitous throughout most data stacks. It’s why over 10,000 customers to date have chosen to go with Databricks.

S-Curves follow the same Crossing the Chasm pattern but can be stacked on top of each other — [Source: Author]

How then, can you take the leap with Databricks?

You need to change the bond between technology, people, and process. In other words, you need a DevOps transformation, but for data. You need DataOps.

There are many articles out there claiming that DataOps is not ‘DevOps for Data’. But I beg to differ, to know DataOps means to know DevOps deeply, and herein lies most of the problem.

What is DevOps?

My favourite definition of DevOps comes from The Phoenix Project, which states that DevOps is the act of taking the best practices from physical manufacturing and leadership and applying them to the technology systems we build to achieve our means.

The principles the above quote alludes to come from some of the below movements:

Lean Manufacturing
The Theory of Constraints
The Toyota Kata
The Scientific Method
The Agile Manifesto
(Lean) Six Sigma

DevOps is, at some level, about what we all typically think it is: automating the tasks of IT operations, writing “Infrastructure as Code,” and bringing the “Dev teams” closer to the “Ops teams.”

But these things are the consequences of applying the teachings of these movements to deliver a more robust, reliable, and productive software product.

The trick to DataOps, then, is to think about how these principles and this culture can be applied to “data products” as they relate to the IT systems we build to deliver them. It’s about working from first principles and leveraging the lessons of the past. In Gene’s Kim definition, we can embody this through “The Three Ways of DevOps”.

So, let’s go through each of The Three Ways and apply them to Databricks to get — The Three Ways of DataOps for Databricks.

The First Way of DataOps and Databricks

Make analytics flow from left to right, from Dev to Ops, as quickly as possible.

The key to thinking around the first way is to think of all work in data analogous to a physical factory pipeline; data goes in as the raw material, and valuable insights come out. Just like a manufacturing plant producing refined oil, we make data products, and we must aim to optimise the throughput of these products without sacrificing their quality.

I’ve found this to be a brilliant metaphor to guide our overall approach to data work. Think of developers as literally the engineers on the factory floor who construct and orchestrate the machinery needed to create analytics.

Consider how their work gets done. When a requirement comes in, what actually happens from start to finish? How many people are involved? How many teams, and how many different technologies? How do they test if the requirement was met? After it’s deployed, who ensures it continues to run? What Service Level Objectives are essential to this workload?

The DataOps Workflow for Databricks. Essentially, ELT , into your data model, then push through the medallion architecture. — [Source: Author]

Through this process, I guarantee you’ll find that every handoff between individuals, platforms, and especially teams incurs a considerable loss in time and productivity. This is why DevOps exists: to merge the roles and responsibilities of Developers and Operations into one, to prevent information loss, to make technology reproducible (think IaC, VMs, and Docker), and to relieve the constraints that cost us time.

This is where Databricks steps in. There are many reasons to buy Databricks, but in my opinion, you are primarily purchasing Databricks for the ability to streamline your processes and maximize analytical throughput. Business use cases are a dime a dozen, but the machinery to produce them doesn’t just magically appear. For example, there are lots of painful problems to solve before you do anything actually useful with data, such as:

A subset of all the different problems you need to solve to safely work with data at scale — [Source: Author]

Databricks has an answer for all these challenges. You might think the answers to these questions are trivial, but to build them yourself in a fragmented data landscape is not at all. It’s work that weighs on you. To put it in SRE terms, Databricks is a means to eliminate toil. In practice, we typically see this translates to an average doubling in developer productivity. That’s twice as much work flowing from Dev to Ops!

This is why Databricks is a “data hyperscaler.” It is a uniquely vertically and horizontally integrated data stack; everything is in one place, reducing the need for handoffs and data movement. It relies on universal cloud services and open data formats and integrates with almost every data tool on the market.

With that in mind, if you examine how your analytics factory works today, I can almost guarantee that the flow of work from Dev to Ops will be bottlenecked at a point where technology and people intersect. This is the real value of Databricks: it brings your team together on one platform, increasing analytical throughput by reducing handoffs while using open data formats to allow for heterogeneity of ways to interact with your data.

The Second Way DataOps and Databricks

Make your Lakehouse a safer and more resilient system by receiving feedback as soon as possible.

In the second way of DataOps, your goal is to enable your engineers to make changes to your analytical pipelines without fear. Imagine a new intern joining the team. Could they push a simple change to production, like renaming a column in a table in less than one day? Does the prospect of that change send a shiver down your spine?

The only way to do this safely is to create consistent, fast, and reliable feedback loops. It is the key to building resilient and adaptable data products.

Fortunately or unfortunately, there is typically only one way to get that feedback, and it’s a word many data engineers fear… testing and lots of it.

When a faulty car is produced at the factory, it’s real and tangible, we can see it (hopefully). But in software systems, it’s invisible unless you specifically test for it, which doesn’t mean the consequences can’t be as dire. Like the great Edsger Dijkstra said — “Testing can be used to show the presence of bugs, but never to show their absence”.

This is critical today as data analytics isn’t simply about Tableau dashboards anymore. Data products are almost always operational in nature. A simple forecasting engine breaches SLAs for one day, and millions of dollars can be lost. A bad input goes into a risk lending model, and someone’s insurance can be rejected.

In reality, all developers, even data developers, test all the time, just with their eyes. How do they know what they built worked otherwise? Tests of any kind are the formal verification of requirements. The tricky part of testing is the effort required to codify and automate your expectations.

Testing can be simple in practice, data pipelines are straightforward, one-way processes. Imagine a mathematical function: f(x) = x * 2. If I put 2 in, I should get 4 out. A business rule is just the specific application of a function; it’s hardly different. A supermarket might double the reward points it gifts customers when they buy its in-house brand.

Blue-Green pipelines on the lakehouse — [Source:Author]

The beauty of owning a data lake and adopting the medallion architecture is that storage is decoupled from compute. This means that, in practice, it is simple to deploy two different versions of your pipelines and run them side by side with the same inputs. This is because every data pipeline is essentially a black box function, data goes in, and different data comes out. You have a space to try new things out safely and do things such as make sure those transformations are as expected. This can be done with a basic PyTest suite combined with Chispa and dbldatagen or something more out of the box such as Delta Live Table Expectations.

This is where we start to encounter one of the core problems in data as a whole. Data is a living being. It is constantly morphing and changing. In fact, if we weren’t continually checking for change, there would be little value in data at all.

It’s a painful reality of working with data that when we develop, we have to make assumptions about the nature of the data we are expecting as input. These assumptions can’t constantly shift; if they did, the code would have to change with it and, therefore, the code never gets done. Likewise, in production, the opposite applies: the data changes constantly, but the code itself is static. That way, we get those expected new insights with fresh data inputs, but without regressions in functionality.

This is the duality of data engineering. Our process is always divided into two pipelines: one for creating the code to do new things with static inputs, and the other for realizing the value of that innovation by feeding it fresh data. Reconciliation of these two pipelines is a challenge for all data teams, large and small.

The duality of DataOps means that we have to constantly check our assumptions in production and reconcile them with development — [Source: Author]

Databricks, Unity Catalog and Delta Lake can help you solve this dilemma. Many tradeoffs need to be considered to solve this problem in full, which will be discussed in another post, but given storage is cheap, and compute is decoupled in a Lakehouse, a simple solution could be to take an entire database and Delta Clone it. This would give an isolated representation of production in a lower environment for all your developers to work with but at the cost of working with production volumes and compute (with potential security implications). You can do this with either deep or shallow clones too.

Once you have cloned your production data, the next step is to analyse and understand it thoroughly. You need to profile the inputs using tooling like Lakehouse Monitoring or in-built profiling and align them with expectations. You would repeat this process for every stage of your pipeline to get feedback on the behaviour of your code.

Now if you had absolute certainty of what those inputs are and you automated those checks to a high test coverage, the change from your intern is pretty straightforward. The new pipelines are deployed in isolation and it either passes or it doesn’t. The beauty of Unity Catalog too, is if they don’t pass, you have all the lineage up and downstream of that change, and every query that has followed it too, all tracked, so you can identify and remediate quickly what broke.

Turn the same process on in production and this solves the second part of the equation: knowing our inputs and continuing to check that they are as expected when actual data is flowing through your production pipeline. After all, valuable analytics can only be created in production when new data is fed into it.

The Third Way of DataOps and Databricks

To combat the entropy of data systems, foster a culture of constant experimentation and improvement that is aligned with the Databricks roadmap.

There are two keys to living in the third way. The first is having a scientific mindset, and the second is being able to reserve time to build for the future. They go hand in hand, allowing you to form hypotheses about improving your processes while giving you the breathing space to act on them and see the results.

Ward Cunningham wisely said — ‘Shipping first-time code is like going into debt.’ This rings true for data teams too. Tech debt silently accumulates, and most organizations lack the resources to repay it. Addressing it early is key to maintaining agility and efficiency. I personally have seen time and time again how the burden of keeping the lights on leads to perpetual toil. If you spend 100% of your time putting out fires to keep production alive or attempting to maximise developer throughput, there’s no time to improve your system to prevent the fires from occurring in the first place, or to lift the constraints that are hampering the amount of work that can flow from Dev to Ops. It is a perpetual ball and chain around your ankles.

For example, in the past two years, if you take all the repeating BI queries that have ever run through DBSQL over all Databricks customers, they are running 73% faster on average than they used to be. Given that time is money in the cloud, that is a three-quarter reduction in cost for nothing. If two years ago you planned to adopt DBSQL, you essentially just got this benefit for free and fully realised it.

Databricks is constantly optimizing and improving performance on behalf of it’s customers — [Source: Author]

On the flip side, if you decided to bolt on a complicated YAML metadata-driven framework over Databricks, you might find yourself locked into an old Databricks Runtime, and unable to realise these improvement benefits without a substantial refactor. You need in-depth knowledge of how this abstraction works, or you’ve now essentially locked yourself into an outdated version of Databricks.

Whatever situation you happen to be in, a future target state is always there to be found. Databricks is a rapidly evolving product. Rarely do three months pass without a significant feature being released on the platform. At the time of writing, we are approaching Spark and Delta versions 4.0, and Databricks will be the first platform to integrate the two seamlessly.

You could form a hypothesis that with Delta 4.0, your Spark SQL queries on Delta 3.1, given they parse a lot of unstructured JSON, will speed up by 30% through the use of the new Delta Variant Type. So you action a plan to set aside the time to identify 5–10 representative queries and set up a test environment to compare their performance with and without Delta Variant. If the average improvement is close to 30%, you then plan to upgrade to Delta 4.0 and identify the next obstacle in the way.

The key takeaway here is that staying current with Databricks isn’t just about adopting new features. It’s about embracing a mindset of continuous improvement. By carving out time to experiment, test hypotheses, and implement platform advancements, you’re setting yourself up to leapfrog ahead. In the world of DataOps, standing still is essentially moving backwards.

This post has covered a lot of ground. DataOps is all about the art of applying the best practices from physical manufacturing and software engineering and applying it to the data software domain. It is about using first principles to address the distinct challenges inherent in working with data.

Data comes from a different world. It is the realm of statisticians and bookkeepers. The rigour that was instilled in software engineering is only now coming to data engineering. DataOps isn’t merely about adopting new tools or platforms, it’s a fundamental shift in approaching data work. It involves breaking down silos, embracing automated testing, and cultivating a culture of continuous improvement.

Picture our hypothetical intern again. A true DataOps environment should empower them to rename that production column within a day. Does this seem implausible in your current setup? If so, you have room to grow.

Databricks gives you a head start on this journey. It’s only your team’s specific application of the features Databricks offers (such as Unity Catalog and Delta Lake) that determines how robust your DataOps practice can become. Let’s frame this in The Three Ways:

The First Way: Databricks as a whole streamlines your processes, reducing handoffs and improving the flow of work from Dev to Ops. The Lakehouse architecture and Unity Catalog simplify your data infrastructure and eliminate toil.
The Second Way: With Delta Lake and Unity Catalog, you can enable safe and reliable feedback loops for your developers.
The Third Way: Databricks constantly innovates on behalf of its customers (like the 73% speed improvement in DBSQL queries over two years). To capture these benefits, it is imperative to have a culture of continuous experimentation and improvement. This helps you stay aligned with the Databricks roadmap while combating the gradual degradation that comes with accumulating tech debt.

Your path to DataOps on Databricks is unique. It requires commitment, experimentation, and a willingness to challenge the status quo. Focus on how Databricks can help you streamline the machinery that produces analytics so that your engineers can spend time doing the things that matter.

The payoff is substantial: faster insights delivery, more reliable data products, and a more agile, responsive data team.

I encourage you to apply what you’ve learned here. Start small if needed, explore how Unity Catalog can streamline your data governance, or perhaps implement a simple testing framework. Take that first step toward DataOps.

In the next chapter, we will explore how Databricks can dramatically improve the flow of work from Dev to Ops. Databricks Asset Bundles are an obvious candidate here, but the answer might surprise you. Hint: It involves leveraging one of the best properties of a data lake.

The Foundation of Modern DataOps with Databricks was originally published in DBSQL SME Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Future Of Test Data

David O'Keeffe — Tue, 26 Oct 2021 23:16:03 GMT

The Data Engineering Testing Series

Welcome to Part 5 of the Data Engineering Testing Series! Hopefully, by now, you have a relatively in-depth understanding of the concept behind this process, and how it works; if you don’t, it’s okay to read on but feel free to use the links below to learn more.

Today we’ll go through the potential of ML to revolutionize the test data management space. While I may brandish the terms ML and revolution around as buzzwords, in this case, there’s some serious merit behind the idea. Everything ML right?

Data Testing Is Painful

For the first-timers, a quick breakdown of the whole series is you should use test-driven development (TDD) as a quality framework for your data engineering workflow. To do that, you should generate data with code and pump it through your test suite of choice.

A major sticking point of the whole process thus far has been the coding bit (part 2 of the series) is an annoyance most would rather forego. The thing is, I view it as a necessary evil, no one is solving the problems of test data management (TDM) overnight (again in Part 2), and having the ability to create your own data is a valuable technique for any data engineer to have. You can always create data that represents the “golden path” of your workflow, pump it through your pipelines, and that’s enough to scrape by in just the beginning.

However, that doesn’t change the facts; it’s tedious work that requires a decent understanding of production data and a skillset perhaps few have. So anything we can do to reduce the pain of this process will remove a massive blocker for the whole idea. Furthermore, the more accurate the data you use for your generation, the better your development workflow, which equals fewer production problems at the end of the day.

ML To The Rescue!

Here comes the revolutionary idea, what if we don’t generate data in the traditional sense using code (like in Part 3) but instead rely on a machine learning model to learn what the dataset is instead. This model then spits out a dataset of n-rows whenever you call it. A true expression of software 2.0, as Andrej Karpathy would put it. The advantages are obvious:

Technically you don’t need to touch the production dataset anymore to do your development. A massive bonus for organizations that don’t necessarily want developers handling sensitive data.
The result is considerably more realistic because a computer can pick up on the statistical properties of the data you’re using better than you can.
The model maintains referential integrity among many sets of tables for you—a difficult task to undertake with bespoke tools.
It is theoretically a quicker and more reproducible workflow than relying on human intuition to produce realistic data for development.

The drawbacks:

How can you guarantee production data is not going to leak between environments? Is it possible to reverse engineer production data from the model?
It potentially could require a ton of computing power, and therefore time to produce the models. You could end up with the same problem of stale data in development but on a slightly different and perhaps more expensive scale.
The models themselves have to be maintained and verified somehow.
It limits your ability to narrow down the scope of your tests or create custom data. For example, you may want to give exception handling for a rare case that’s never been seen in production yet. There’s no way around this as far as I can tell.

The Market Is Booming

With the pros and cons out of the way, unfortunately for myself, I didn’t come up with this idea 😅. In fact, this technique has been around since 2016, but it appears this year to be taking off, with the main players in the game being:

The Synthetic Data Vault. Put synthetic data to work!. Soon to be DataCebo — a clever name that’s for sure (placebo — datacebo… get it?)

Tonic.ai, The Fake Data Company

Gretel

In my ideal world, which is somewhat security conscious, my wish-list for these systems would be:

It absolutely can not leak sensitive data to lower environments, so it needs options to detect PII/PHI data and obfuscate it. This would most likely need human intervention to get 100% correct.
It needs to have a solid authorization and permissions model to support the above as it will require human access to production data.
It runs on a service that is elastic, so I don’t have to worry about scaling compute or storage.
It can pick up relationships between tables and even between columns in the same table. This is to keep the fake data consistent.
It has orchestration and alerting built-in.
It should have an SDK (preferably in Python) to interact with.
The models should have metrics around their reliability and accuracy.
It keeps a history of the models created so we can replicate the outputs of our pipelines for debugging purposes.
The models should produce data quick enough as to not slow down my test suite or my development workflow significantly.

Data Quality Testing Reins Supreme

If any of these products get it right, it can be a game-changer for our TDD technique. I suspect what we would start to do is pivot towards testing on the level that is a bit more like the “data quality testing” we talked about in Part 4 and less of the DataFrame to DataFrame comparisons. This is because, in return for development speed, we have traded off the formality of rigidly defined tests (as our test input is more unknown now). I’m sure most businesses are willing to accept a slightly higher error rate in production in return for a shortened time to value.

Again, thanks for taking the time to get this far and please stay tuned for Part 6, where I’ll go through SDV, Tonic.ai, and Gretel, to assess their strengths and weaknesses.

As usual, you can reach out to me on LinkedIn or in the comments below.

The Data Engineering Testing Series

Part 1: Why Great Data Engineering Needs Automated Testing
Part 2: The Keys To Unlock TDD For Data Engineering
Part 3: The Test Pyramid and Data Engineering (with Julia)
Part 4: What Is Data Quality Really?
Part 5: The Future Of Test Data
Part 6: The New Kids In The Test Data Game

The Future Of Test Data was originally published in Cognizant Servian on Medium, where people are continuing the conversation by highlighting and responding to this story.

What Is Data Quality Really?

David O'Keeffe — Wed, 25 Aug 2021 05:27:09 GMT

The Data Engineering Testing Series

What Really Is Data Quality?

I Have Great Expectations…

Welcome to Part 4 of the Data Engineering Testing Series. So far, this series has mostly been concerned with how do we ensure that our data transformations are correct. The solution we came up with is one that is borrowed from software engineering, and relies on data generation with code, a concept quite foreign to a lot of professionals I’ve come across.

The reason I suspect it is so uncommon is because the idea of having to mimic changes in data with code can be ludicrous. First I have to understand what the changes are, then I got to change a bunch of code, and fix the broken tests? Sounds like a lot of work for little gain.

Other common reasons against going for this kind of approach are:

What if the changes are happening rapidly? I’ll never catch up.
It’s SQL, the transformations are so obvious, we don’t need tests for it.
It’s not testing the real data, I don’t see the point.

I won’t lie, this line of thinking is tough to combat, because the costs behind these decisions is hidden. At the end of the day the business does not care about the means, it only wants the ends, and it will accept what costs comes with that. Because of this the standard actually has to be enforced by the professionals, they should know better, they should know that not automating testing for critical workloads, when you can, is a form of professional negligence.

Test you fools!

But say for whatever reason, you can’t sell the data generation approach to your peers. I guarantee you the one thing you can always sell is data quality. It’s the next best thing because, the word quality sounds better, and it’s basically testing anyway.

Check the E and L in ELT

Most of the time when we talk about data testing, it’s common to immediately think about testing the data itself, rather than testing the correctness of the transformations we are doing to that data to make it useful. It makes total sense, it’s the data… and we’re testing it. But this is where the terminology of “data quality” and “data testing” becomes blurred.

To me, the primary purpose of data quality tests are to check that the E (extract) or the L (load) in ELT is as expected. It is about ensuring that we detect issues with data at the source, or issues with extracting that data out. This is because you should know what your code is doing (i.e., what you do to the data) with the data generation method. If the input isn’t as expected, well, you can be sure of unexpected results.

Hence data quality checks are geared to answer a slightly different set of questions than those used to validate the correctness of transformations, such as:

Is my schema as expected?
Do all the values in the rows have the correct type?
Are there any weird values in my data today?
Are there duplicate values?

In this way, data quality is complementary to data testing. It does not have the formality of the data testing approach, it lacks the same level of guarantees since the input is not fixed. However, it is fantastic operationally, as it acts as a mechanism to check how your pipeline is working in the wild. I see the two to be entirely synergetic to each other.

Define “data quality”

Broadly speaking, data quality simply means, the data is fit to use. Our preference is that ‘quality’ comes from the source, but as we know, it doesn’t always pan out that way. Thus, data quality in the aspect that most engineers think of it, means to check whether if it’s fit to use.

These checks are inevitably a variety of methods that make assertions on columns or rows. The assertions can be the same as those that can be used in the data generation approach, except this time, the data is real. They can be quite sophisticated, you have the common nullability and uniqueness constraints, but you also have the ability to apply regexes, and assert values within a confidence interval, among many others.

Providing you with a framework to define these assertions is the crux of most data quality tools, with great expectations being the popular open source option of today.

Theoretically you can even generate data, pump it through your data quality framework, and that forms the test specification part of your test suite (i.e., the part that defines what to check). That way you are maintaining only one set of specifications instead of two. If you combine this with test markers you can even feature toggle tests depending on whether they rely on real data or not.

It’s worth showing this test suite diagram again.

The true beauty of this method is you now have a mechanism to make tests break when the real data changes. It’s easy to see how this can feed back into writing more formal tests with code, as you discover more tests to do based off the cases you discover.

In conclusion

Let me finish with few easy dot points…

Data Quality DOES NOT

Give you guarantees on the behaviour of your transformation code (as the input can change)
Buy you testing rigorous enough for good continuous delivery pipelines (because you can’t be sure of what your code does with input you don’t know)
Have the benefits of developing with static data (because the data is constantly shifting from underneath you)

Data Quality DOES

Give you another kind of testing you don’t necessarily think about when you’re generating data for tests (because it’s obvious what the inputs are since you generated them)
Stop bad data ruining your day (since data in the real world is messy and you NEED to check it)
Is testing the real data and therefore giving you a mechanism to make your tests for transformation correctness break in production

Data Quality FRAMEWORKS

Can double as the specification of your data test framework (therefore removing the need to maintain two sets of checks)

Later in the series, I’m going to experiment with implementing a data quality pipeline with LakeFS, a tool for git like integration with data lakes.

Again, thanks for taking the time to get this far, and please feel free to leave any comments below or reach out to me on LinkedIn.

The Data Engineering Testing Series

Part 1: Why Great Data Engineering Needs Automated Testing
Part 2: The Keys To Unlock TDD For Data Engineering
Part 3: The Test Pyramid and Data Engineering (with Julia)
Part 4: What Is Data Quality Really?
Part 5: Machine Learning Is The Future Of Test Data

What Is Data Quality Really? was originally published in Cognizant Servian on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Test Pyramid, Data Engineering, and You!

David O'Keeffe — Wed, 10 Feb 2021 01:58:08 GMT

The Data Engineering Testing Series

Writing your first test can be a daunting moment. Every test comes with a cost and some more than others. You will find yourself asking questions such as:

What do I test?
Do I test the whole system?
Do I test every function?
How many tests do I need?
How many failure scenarios am I satisfied with? If any?

Unfortunately, I can’t explicitly tell you the answer to what is right or wrong for your particular scenario, but hopefully, this post can help you out. I’m going to walk you through the process of creating tests for a simple data pipeline in Julia. Why Julia you ask? Because it’s cool.

The rest of the series continues as below:

Part 1: Why Great Data Engineering Needs Automated Testing
Part 2: The Keys To Unlock TDD For Data Engineering
Part 3: The Test Pyramid and Data Engineering (with Julia)
Part 4: What is Data Quality Really?

The Scenario

In our fictional scenario, we work for a cryptocurrency platform that helps investors track their trades. As part of our platform we’ve hired an external vendor to deliver the following timestamped Arrow DataFrame to us at some regular interval:

20200114_151100_new_investors.arrow

DataFrame(
    names = String[], 
    countries = String[], 
    net_worth = Int64[], 
    holdings = Float64[], 
    y = Int64[], 
    z = Int64[], 
    id = String[]
)

The boss upstairs is chomping at the bit to know how many people from each country are in this DataFrame so they can focus their marketing efforts. They want to keep track of how the number of new investors per country changes over time.

Our data is luckily being delivered in a nice structured format (being Arrow), the timestamp is in UTC, and it should only contain new investors, thus we should not have to worry about deduplication, time zone issues, or deserialization.

Choosing your test scope

Before we jump in, let’s gather an understanding of the theory behind test pyramids. Simply put, there are several levels of testing, and as you move up the levels, the tests become increasingly flaky and expensive. Take building a table for example, the legs, the bolts, the brackets, and the tabletop are the individual parts, and you then put them together, eventually to form a whole system.

To test this table, first, you can look at all the parts and make sure they are structurally sound. This would be a unit test, it’s quick and easy to do. You can then start putting the parts together, forming a larger system, and testing that they also work together. Perhaps you pull on the legs to make sure they’re attached to the bolt and the bracket. An integration test, which is a slightly more labour intensive exercise. Finally, you put the tabletop on and make sure you can stand on top of the entire table. The final end-to-end test, and most expensive.

The basic philosophy behind test pyramids is that unit tests are superior to end-to-end tests as they are much faster, have obvious failure points, and are reliable. Thus you should invest in having lots of unit tests, a few integration tests, and even fewer end-to-end tests. In our table scenario, if we’re confident that all the parts are structurally sound and they fit well together, we probably don’t need to stand on it much. Likewise, if our end to end test fails, we don’t know whether it’s the bolts, or the bracket, or any other piece that’s the problem until closer inspection.

Pivoting back to data testing, we should look at the levels we can test it in a slightly different way to software. This is because instead of three levels of tests (unit, integration, and end-to-end) you usually have five. These levels are highlighted by layers of the dotted lines in the diagram below.

Full system, including client
Pipeline, including service
Multiple jobs
Single job
Unit/component

In this scenario, we have an application that interacts with a microservice, that streams data to a pipeline with several jobs, which eventually populates a database that the service returns a response with. This pattern is typical of a machine learning pipeline for recommendation engines.

One of my favourite resources on this subject is Lars Albertsson (it’s also his diagram above). He suggests that with data it’s only worthwhile testing at the job to pipeline level. His argument being that unit tests are too volatile because ‘the data’ in data pipelines is often rapidly changing. This level of testing works because each job in a data pipeline is essentially a black-box function, free from external factors, so it shares many of the same advantages of unit testing.

Pure data pipelines are free from external factors

However, let us consider our scenario again, this is where I prefer to follow the advice of Kent Beck, one of the leaders in TDD thinking.

I get paid for code that works, not for tests, so my philosophy is to test as little as possible to reach a given level of confidence (I suspect this level of confidence is high compared to industry standards, but that could just be hubris). If I don’t typically make a kind of mistake (like setting the wrong variables in a constructor), I don’t test for it. I do tend to make sense of test errors, so I’m extra careful when I have logic with complicated conditionals. When coding on a team, I modify my strategy to carefully test code that we, collectively, tend to get wrong. — Kent Beck on Stack Overflow

The scope that’s right for me

Immediately upon looking at our task we can see we got three tasks in our pipeline and that it should only consist of one job:

Read in an Arrow file as a DataFrame (unit test)
Parse the timestamp and perform the aggregation (unit test)
Append the new data to an existing DataFrame (unit test)

In a real-life scenario, where the disk could be an object store (S3, ADLS Gen 2, Cloud Storage), our system would also have to address the following issues:

How do we keep track of what Arrow files we have already consumed? (unit/integration test)
How do we manage different versions of our country count history DataFrame? (unit/integration test)
What are we going to use for compute? (integration-test)
How are we going to trigger (orchestrate) the compute to do our job? (system/integration-test)
How are we going to ensure the integrity of the data we’re consuming? (unit test)
What would we do if the data did not arrive when expected? (integration-test)
How do we recognize and recover from failures? (unit/integration test)
How do we deliver the data to the dashboard? (integration-test)
How do we keep our data safe? (system/integration-test)

We don’t have the time to address all these issues, so I’m going to assume that our job will be running inside a container that is triggered manually, and for our MVP, generating the aggregated dataset is enough. I will use storage inside the container as a proxy for the object-store.

This dramatically reduces the scope of what I think I need to test. Remember, we get paid for code that works, producing value at the end of the day is all that matters. With this in mind, I’m going to go with five tests to get started quickly:

Unit test to aggregate the data to perform the country counts
Unit test to parse the DateTime from the filename
Unit test to add that DateTime as a new column to the aggregated dataset
Integration test to make sure that all three units work together (job level)
Manual test to make sure the Julia container runs to completion when triggered (system level)

Notice how I am not going to unit test whether I am reading and writing the data correctly. I also skipped unit testing whether the DataFrame append command works. I am going to assume that the libraries that are doing these functions for me already work. Additionally, I will get test coverage of this with the integration test. I opted for a manual test of the container because Medium articles don’t come with CI/CD pipelines.

Setting up the project

The first thing we will have to do is set up the repository structure. In Julia, you must create a folder with the intended name of the package, navigate to that folder, then use the REPL to activate the environment. You must then create the following two scripts to enable the package and tests.

/src/.jl
/test/runtests.jl

Julia requires these files with these specific names for it to work. At the end we have the following directory structure:

To help split up the tests into their own scripts I am employing the use of the Jive package. This will allow me in the future to add extra scripts in my test folder that can be ignored when running tests. It also makes the runtests.jl file quite succinct.

https://medium.com/media/f1612b13ba79eed50b47b945c9a04cb6/href

The tests

Since this is the first run I will only write tests for the happy path, that is, the form of what I expect the data to be. So lets start by with the unit tests:

https://medium.com/media/e254ff6425f7eac56aec0f81352fa3a4/href

You will notice a few things here:

Each of the unit tests we identified corresponds to functions in the application.
The tests are extremely simple and exclude unnecessary columns.
Creating DataFrames in Julia is quite a bit more succinct than in other languages. Simply declare the column name as the argument.

Now onto the integration test, which is at the job level. Here, we produce files and save them to disk, testing not only that we read the data correctly, but that we write the right data back. I also threw in a few dummy columns to the input to make sure that they didn’t impact the desired result.

https://medium.com/media/6df4934a469858c92ca5bfdc6345aadf/href

Here I write out the input and the expected output to disk (lines 23–24), call my main function (line 28), make my assertion (line 38), then clean up the results. I couldn’t find a neat way in Julia to implement the usual test setup and teardown functions that are available in suites such as pytest, so if the test failed for whatever reason, the files remain on disk (if you know how to do this please leave a comment below).

The Country Count Application

As we have planned out our tests, we now want to match our functions with the functionality we are seeking to achieve.

https://medium.com/media/a0aeba098b7270ef7dea7742a418774a/href

This “job” script contains six different functions. Notice how I could have put all this code inside the main function easily enough, but to test the individual pieces we have to split them up into their own functions.

Finishing Touches

Now let’s create our final manual test, which will be just running the test suite to completion. This will test whether we’ve set up our Julia project correctly.

https://medium.com/media/5affd86376fb0c3550340abba3761519/href

To run simply navigate to the CountryCountJob folder in your console then run docker-compose run julia-data-testing. Docker will take care of most of the heavy lifting, and we will be left with the following output.

Beautiful!

The devil of structural coupling

For those savvy software architectural experts out there, you would have noticed that we are coupling the structure of our program to our tests. This is generally considered to be an anti-pattern in software development, as now you need to change your tests when you change your program, which of course contributes to test fragility.

However, keep in mind that we are working with data pipelines, and while it shares similarities with software projects, it’s not necessarily the same thing. Remember at the job level, we are essentially working with singular black-box functions, thus we are not dealing with an intense amount of coupling. If it is a deal-breaker for you though, you may forgo the unit tests and just have an integration test (Lars’s suggested pattern).

The main contributor to test fragility in data is usually when jobs share the same changing data source. This is where we bump into one of the main sticking points of this approach… when the real data changes, how do we make the right tests break? That will be next time.

Thanks for taking the time to get this far and please feel free to leave any comments below or reach out to me on LinkedIn.

The Data Engineering Testing Series

The Test Pyramid, Data Engineering, and You! was originally published in Cognizant Servian on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Keys To Unlock TDD For Data Engineering

David O'Keeffe — Tue, 22 Dec 2020 03:03:23 GMT

The Data Engineering Testing Series

Testing can be evil.

In Part 1 I hope you were convinced that automated testing is something worthwhile investing in for your long term sanity. If you agree with me that putting testing in the front seat is a good idea, and our jobs are easier when you write the tests first (which is Test Driven Development or TDD), now we have to give some thought into how to actually do it. Which to be honest, can be quite daunting.

A Deeper Dive into the Data Testing Problem

Before continuing further, it’s helpful to get an understanding of why working with data adds more complexity to the testing process. The main issue is that data is quite literally, the essence of state. It is for this reason that the programs that we are producing are always tightly coupled to the data. If the data changes in any significant way, so will your code.

This poses the greatest challenge in regards to testing because the test data has to be representative of production data. A change in production will cause a change in test that needs keeping in sync. In a normal given this input, assert this output test scenario, propagating such changes across your test suite is time consuming to do at any decent scale. This is because you have to modify three parts, the input data, the output data, and your code upon a change.

Now let’s consider how this complicates things in three different lenses:

Traditional databases
Big data
Security

Traditional Databases

When working with monolithic data pipelines every change causes the entire system to be rebuilt. This makes keeping up with state changes in multiple environments especially time consuming and costly. Given it can take a long time (usually hours) to replicate state, you can encounter issues with it going out of date quickly by the time it switches from the development to the testing stage.

Databases are commonly shared in these environments, which means that the changes are not in isolation, which can be a nightmare to manage with more than one developer. Most modern data platforms use EL(T) instead of ETL partly for last reason since it solves the problem of coupling the extract and the transform stages together. Pipelines are free to exist in isolation and therefore can be updated and created easily by multiple developers.

Copying multiple databases across environments has it’s issues unfortunately.

Big data

For big data, it primarily suffers from a replication problem, as it’s generally infeasible to duplicate all of production in test, not to mention that if we did, running the tests will be astronomically expensive and slow. We also have to contend with the fact that it’s “big”, and thus there are potentially so many test cases that it’s a huge task to get decent test coverage.

The problem is not intractable though, as at the end of the day, when you’re developing on big data you always distill it down into forms that can be rationalized. For example, you might aggregate a field based off it’s name and type, which are truths about that data which rarely change. When you encounter a new case in this scenario, you debug it in production, and then create the appropriate test. That way with time you build coverage.

Security

Usually a level of isolation is required for production data. This means that you probably have to obfuscate and encrypt the data in some way to copy it across. The process isn’t foolproof as you run the risk that production data could leak into non-production systems. Often this is actually a show stopper because many teams would prefer just to develop in production rather than go through this process on a regular basis.

Once you’ve copied the data across, you now have to deal with the pain of developing with fields that look nothing like they do production. This is more of a problem for the data scientists of this world, as the exploratory process is one they revel in, but for data engineers it is more difficult to verify whether certain transformations have been applied correctly.

Test Data Is Code

Unfortunately there is no way around the fact we need to keep state around to test our workloads. As alluded to before, keeping state in file or database based systems is difficult, mainly because you can’t easily keep track of how the data should change when it’s hidden behind a layer of abstraction (i.e., you need to open the file). But we can avoid the traps of checking in files to source control or copying our data from one environment to another by generating the test data at runtime instead. This comes with a multitude of benefits:

With this method it is a lot easier to refactor your test suite with any number of changes.
Privacy and security concerns are mostly alleviated because the data isn’t real.
You can now shape the data to scale with the type of test you desire. For example, say you’re testing a streaming pipeline, you can generate one event to test that you handle it properly, and one million events to test the system processes it in a timely manner.
You can generate new test cases that you may never see in production, therefore buying you more test coverage.

A schematic for a typical testing framework pattern. From Zander, Justyna & Mosterman, Pieter & Schieferdecker, Ina. (2008). Quality of test specification by application of patterns.

With generation, depending on the type, it does make the job of using the test oracle (the part that does the assertion) more difficult. If you are creating random data at runtime, then how can you assert the result of something that is unknown? This can be solved by limiting the scope of the tests and only doing full record comparisons rarely. For example, you could verify that the structure of the data is the way you want it to be, or that the row counts are as expected. Another fancier solution would be to pass the randomly generated data through a function in your test suite that creates the assertion, which should work nicely for testing aggregates like averages or sums.

That said, it is okay to explicitly generate input and output for certain functions that must have it. This is because it’s far easier to refactor consistently in source control than it is to fiddle around with files or databases. The best practice in general is to try limit the number of fields in the test record to only those that are needed for the test to pass, in this way you reduce the chances of any one change breaking more tests than it should.

What runs the tests?

Most programming languages have a variety of test frameworks available. Python has pytest, Scala has ScalaTest, Java has JUnit, Julia and Golang just have testing straight up built into them. They all have similar features, a series of test oracles, the ability to spin up test fixtures (used to initialize the system for the tests), and various ways to produce a report at the end. I recommend utilizing them as the foundation of your testing suite.

A beautiful test harness diagram. Thanks Lars!

If you’re stuck for ideas on how to implement a testing suite, fortunately most open source projects have test suites built into them. Delta Lake has test fixtures that create the Spark Session, and it also has the patterns to make assertions on Spark DataFrames available in Python and Scala. It’s relatively simple to use the same patterns for your own purposes.

In my experience the most challenging aspect of setting up the test harness is creating the fixtures in a way that is representative of production. Having a decent understanding of containerization is an obvious benefit here. Unfortunately, if you’re using most cloud based PaaS tools, the chances are you won’t have any fake implementations on offer to add the power of mock objects to your suite. I have spent countless hours fiddling with containers that interface with these tools and the result is often a brittle (and slow) test suite that needs constant maintenance. I believe this is mostly by design so that the cloud vendors lock you in, but it does come with the benefit that you’re running your tests on the same metal as production. However, if you are in AWS you are in luck, as there is moto. You always can create your own mocks if you’re working with Azure and GCP.

Keep in mind that any test framework you use will essentially lock you into it. This is because you will have to refactor all your tests and all your fixtures to switch frameworks. It can become a problem as your testing suite should have all the same version dependencies as production. Be wary of this if you choose to use a third party test framework since you will be pinning most of your production dependencies to theirs. For example, if you use spark-testing-base and your plan is to upgrade to Spark 3.0.1 right now, then you won’t be able to until the change is made in that project or you switch frameworks.

A scratch on the surface

In review, the crux of the issue is that data is stateful, which causes a bunch of problems in test representation, therefore making data testing difficult. We can lower the burden of this by representing our test data as code, because code is easier to change than data.

The backbone of your testing suite will be one of the in built testing frameworks in your language of choice. You then have to contend with how to set up the test fixtures required for your system to run. If you’re using any PaaS tools for your workloads then you will need to make the trade off between mocking your PaaS objects and actually using them.

This leads nicely into the next post, Part 3, how to deal with the famous (or rather infamous) test pyramid in the data context.

The Data Engineering Testing Series

Part 1: Why Great Data Engineering Needs Automated Testing
Part 2: The Keys To Unlock TDD For Data Engineering
Part 3: The Test Pyramid and Data Engineering (with Julia)
Part 4: What Is Data Quality Really?
Part 5: Machine Learning Is The Future Of Test Data

The Keys To Unlock TDD For Data Engineering was originally published in Cognizant Servian on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why Great Data Engineering Needs Automated Testing

David O'Keeffe — Thu, 26 Nov 2020 00:30:02 GMT

The Data Engineering Testing Series

Welcome to the start of The Data Engineering Testing Series, a symposium of articles on the dilemma of bringing automated testing to the masses. In part one, I’m going to explain why automated testing can be one of the most powerful tools available in a data developer’s toolkit.

Tests are great

Most people who have spent some time as a software developer are familiar with the joy of a freshly passing build. You are now brimming with the confidence that the new piece of functionality you’ve just written is doing what you thought it would and it isn’t breaking anything else either.

Feels so good!

However, for a lot of people I’ve met in the industry, it appears this experience is rather rare. One could suppose this is because the data industry (particularly in Australia) is still lagging behind in terms of DevOps adoption. But this isn’t completely without good reason:

The fact we’re working with data and code adds an extra layer of complexity
It is difficult to get an accurate representation of production data in non-production systems
We often share environments: if we have tests and they fail — is it the data, the code, or the environment that has changed?
Data pipelines in production can be non-deterministic (think ML model training)
Testing skills are typically not taught alongside data skills, so few know this is a thing that even exists

It’s an unfortunate reality that implementing solutions to these problems requires time, and developer time is money. It is difficult enough in a software engineering context to get buy in to build a test suite because the business problem is not immediately apparent. Throw in the fact that it’s considerably harder to test data workflows and it’s easy to see how we end up here.

But inevitably without automated tests, these things start to happen:

You become afraid of changing your code
You forget about prior functionality
The code rot sets in
Production releases turn into rare and often traumatic experiences

Until one day your manager presents this terrific chart to their manager. It’s one that strikes fear into the hearts of higher management. The budget forecast is blown and the rest is a classic story for the ages.

Uh oh…

Think Testing First

The unfortunate situation above is one many organizations find themselves in. Stuck in the mud after years of development, they’re unable to innovate quickly, and the engineers are stressed. Employee turnover is high, new data products take years to produce, and the competitive advantages sitting in their data are unrealized.

It’s arguable that if automated testing was a first class citizen a large part of this problem can be avoided. This is because automated testing increases code quality dramatically and is a pillar of continuous integration. In the long run the time savings will enable developers to deliver more value.

It is even possible to couple testing with your data strategy. If you are using a data defense and data offense strategy for example, you can write tests to make sure compliance is kept, while also checking that you integrated those sources correctly for your interactive dashboard.

The problem with automated testing is that the up front cost of setting up is high. Furthermore, when working with the cloud, PaaS components do not work locally, and to integrate them as test fixtures can be really challenging. Data and PaaS are a deadly duo in this context, you’re on the cloud because you need to scale, but ultimately you can’t scale properly because you can’t write tests, and you don’t have the infrastructure skills to do it either.

If you believe that most data projects are just like software projects, this cost is completely justified. The long term consequences of not doing so are considerably higher. This is nicely summed up in the graphic below.

From https://www.karllhughes.com/posts/testing-matters

My introduction to testing

I had the pleasant experience early in my career of encountering one of the most gnarly databases known to mankind. I would look through a table, glancing at it’s hundreds of mysterious columns, make an assumption about it’s contents, then write my ETL code accordingly. I would run it in pre-production, saw things looked about right, push it, only to find out months down the track that there was an assumption I had incorrect, or an edge-case deep within it’s shadows.

In fact there were so many edge cases lurking about that I could not possibly catch them all. I would spend an enormous amount of time exploring the data trying to find them. Soon enough I would forget one or another. I realized the only way to cover myself was to attempt to document every assumption I made. How did I document it? I discovered gherkin.

A Typical BDD Scenario

As data pipelines are essentially black box functions I would craft the input and assert the output. No more guess work required and the business could easily confirm what I was doing was correct. Then it was simply a matter of slamming those datasets into a test framework and calling the right functions.

I knew exactly what to expect. It was a liberating moment for me that alleviated so much of my stress and gave me a lot of confidence. It came at the added benefit that I no longer had to sit there and wait for expensive clusters to churn my data as well. The tests ran fast and I could develop entirely on my local machine. Every new edge case became another test in the suite. Finally, the beast could be tamed.

There was a slight caveat; it took me almost two weeks to implement a test framework that would deserialize gherkin tables and pass it through PySpark. After that, I spent many more weeks writing tests, refactoring what I had already written so it would work.

It was difficult to convince management to let me temporarily pivot from my usual development activities. Luckily they did and I consider it to be one of the most eye opening experiences I’ve had thus far in my career. I discovered that it’s possible to test your data pipelines and save a ton of time and money in the process.

Where to from here

I am always amazed when I come across SQL scripts that are hundreds of lines long let alone thousands. Only wizards can manage to concoct such a thing and verify it was working properly. I’m hoping that through this series that we can show how to simplify that process and help generate a testing culture in the data industry.

As data engineers and scientists, just like software developers, we should take testing seriously because it can save us a lot of trouble in the long run. It is a key enabler of continuous delivery and thus businesses will appreciate it too.

But there is a dark side to this approach. With my new mentality as a testing zealot, I moved onto a new project, a much bigger, and more complicated project. I demanded they needed automated testing, so I went a test suite for them, it took a whole month, and what I came up with ultimately was extremely costly. The developers in my team didn’t take it up, in fact, they hated it, it forced them to work in a new way they weren’t ready for, and they went out of their way to throw it in the bin.

The next post will dig deeper into why testing data is so hard, and how you can get a start on the automated testing process. The cliff-notes are: it’s possible, but prepare for a decent amount of pain.

The Data Engineering Testing Series:

Part 1: Why Great Data Engineering Needs Automated Testing
Part 2: The Keys To Unlock TDD For Data Engineering
Part 3: The Test Pyramid and Data Engineering (with Julia)
Part 4: What Is Data Quality Really?
Part 5: Machine Learning Is The Future Of Test Data

Why Great Data Engineering Needs Automated Testing was originally published in Cognizant Servian on Medium, where people are continuing the conversation by highlighting and responding to this story.