Photo by Lily Banse on Unsplash

How Fresh is Your Data?

Why Bad and Stale Data is Keeping You From Delivering More Value

Phil Goerdt

--

Each year I push aside whatever I’ve been reading and go revisit several books that are important to me personally, as well as to my career. I was recently re-reading (or maybe just skimming and reading the resource guide) the fantastic book The Phoenix Project. For those of you that are unaware, it is a book that walks through DevOps, what it is, how to implement it, and lots of great resources and ideas for practicing it.

The premise that the book really distills down to three points, which the authors call The Three Ways. I’ll avoid giving you a poorly summarized book report, but in essence these ways focus on optimization, automation and measurements, and team culture. Those ways are:

  1. Maximizing work flow between Dev teams, IT Ops and the customer.
  2. Constant feedback loops from the customer to IT Ops and Dev.
  3. Creating a culture of experimentation and mastery of skill.

As you can imagine, there is a lot more to it than that, and I highly recommend you check out the book. Until then, we’ll be working with this framework of DevOps throughout this blog post so we’re all on the same page.

While reviewing the resource guide I was reflecting on a number of the client projects I’ve worked on and how we could have worked better by trying to implement some of these ideas, and the ones that so commonly complement DevOps such as Agile/Scrum/Kanban, Continuous Integration/Continuous Delivery (CI/CD), and Infrastucture as Code (IaC) for example. In some instances there were some clear cut places that we could have made improvements. In others… it was a bit more challenging to see how we could have done things differently. Why was that the case?

I’m doing this right… right?

Many organizations have taken the plunge on Agile, DevOps and even Continuous Integration and Continuous Development (CI/CD) when building their software products. This is generally a good movement, and I think that we can all agree that some degree or flavor of these concepts being implemented is a good thing. How come then, is it so hard to develop anything when it’s data related? What are those challenges that prevent us from really having those types of frameworks for data? Confused as to what I’m talking about? Let’s start with a story…

Not long ago I was working with a client that was running into this situation: they developed, tested and deployed some stored procedures and ETL jobs to improve performance. Everything was great and the client was excited that performance in development and test was improved. But once the updated procedures and jobs were deployed to Production, there wasn’t any improvement. In fact, performance got worse!

After some initial research, the plan of action was to refresh the test database with production data so that we could better diagnose and do one-to-one comparisons between the environments. As you might guess, this type of refresh usually takes request forms, coordination between a number of teams, scheduling around conflicts and other commitments, and of course, a lot of time to actually conduct the data migration.

Is this a one-off scenario? Definitely not. I’ve seen situations like this play out numerous times over the years, and in a variety of different clients and situations. It might be data that is stale in a dev or test environment for reporting. Another time it could be rules that haven’t been defined or that need to be tweaked in production to accommodate other changes. Chances are if you’ve work on any data related project, you have run into something like this yourself.

Yeah, yeah. Data is tough. So what?

The crux of the issue above isn’t that developers don’t know how to do their jobs, or the business or customer teams aren’t giving requirements that can be met in one shot. The real issue at hand is that (in almost all cases,) we aren’t developing with parity. What do I mean by that? Let’s talk through a simple example together.

Let’s pretend we are starting a project together and we have three environments to work with: Development, Test and Production. Like any good developer, we build our code in dev, and then test it, and finally deploy it into Production. Pretty straightforward, and here is what it looks like conceptually:

Three environments? How do you get anything done?!

You can imagine this works pretty well, and it usually does. But we’re only seeing how the code flows throughout these environments. A more complete picture would look like this:

Wait… what?

What is actually happening is that we have two work streams going on, one for application code, and one for database code. These are usually done in tandem, and pointing to each other. As work is completed in Dev, it migrates to Test, and then again to Prod. However, you’ll notice that the data is separate in each environment. Hold on to that for later.

But let’s stop to think about what is actually going on in Dev for a minute. We probably have several developers working on several different things at the same time. Maybe we have all of these types of work being done:

Sign me up for “Other fun stuff” please!

Since we have one database to support all of these different types of work, we can start to imagine (or maybe even remember for those of us who have been around the block a few times), scenarios where we have some conflicts. We may have data integration working on new data being delivered into the database, and that data messes up something the report folks are working on. Or maybe the data architects are revamping some table structures, and that has impacts on the metadata folks. Simply put, this is a less than ideal situation.

Additionally, we also have a problem in that the dataset in Dev isn’t representative of what is happening in Prod. Test would have the same issue, as it also is not getting data from Production. Let’s pretend that our refresh schedule is something like this:

Mmm. Sounds delicious… I’ll have the data sampler platter, please!

This refresh schedule could be called aggressive from some clients that I have worked with. One of the reasons these refreshes happen so infrequently is because they take so long to complete and can require a ton of coordination.

So, what have we established by walking through this simple example?

  1. We’re tripping over each other in the Dev environment because we are all trying to accomplish different tasks.
  2. Dev and Test data is usually stale and not representative of what we see in Production. This means our code is not always accurate.

These are pretty big hurdles to tackle. Fortunately we can take some lessons from the DevOps mindset and toolkit.

But change is hard!

Since we understand what the issues are holding us back, how can we come up with a solution that will solve those problems without creating new ones?

There are many different ways we could try to solve this. Some might say to use our current process for refreshing, but change the refresh schedule to weekly. Others might say to use replication tools to replicate all of the production data. Those are both valid in the sense that they will work and solve some of the issues. But what if we took inspiration from other developers?

Consuming versus preserving

I’ve talked with some other developer friends of mine, and one of the things that they love about cloud computing is that it has fluidity. Here is a great example. Let’s say you are creating a VM (or a container if you insist) that will be used as part of a larger application. You noticed that it isn’t working right, but aren’t sure which of the last few changes you made caused the issue. You can either: spend an unknown amount of time trying to debug this, or throw it away and revert to a VM (or container) snapshot that doesn’t have those changes made (allowing you to test as you go).

I know which one of those options I would choose in that situation. Nine times out of ten I would throw the broken one away and retry. Why? It’s easier than going through debug hell and back, and because I have a higher confidence that I will get it right the next time around and not mess anything else up.

So, why don’t we work that way with data? This is partly because of the way that we’ve historically thought about data. It’s always been hard to move or share data. If you mess it up, it is usually incredibly hard to clean up. In short, the stakes are higher. Well… they were higher. I don’t believe that is the case anymore.

The fact is that a cloud based reality means we don’t need to be constrained by those old ways of thinking. And if you’re in the cloud and still thinking like that, you’re really missing out on what this paradigm shift is offering.

I’m lost… What do you mean?

Remember all of those different things we were doing in Dev before? Imagine an environment where we all can have the same starting point, and also have our own space to work. Sounds nice, right?

I must be dreaming…

This type of environment partitioning allows us to separate our activities and have better control and understanding of our development impacts. Many developers have traditionally used local environments to build and test their application development work. Once they are done with their development, those features are committed into some kind of source control and merged. Only then are these changes migrated to testing.

This has always been hard for data developers for the reasons we mentioned above. But with this model, we allow our report builders to be unaffected by the data integration folks, and the application engineers be unencumbered with database changes being developed by the architects and DBAs. Each of these groups can work within their own swim lane, meaning that they don’t need to wait for ETL to finish running, for example.

The other critical aspect of this is that the data is as close to what we would see and expect in Production. This reduces our chances of developing something that is not reflective of reality, because we didn’t take into account some edge case that wasn’t present in stale data. All too often there would be instances of data being radically different in these environments, leading to a whole host of problems.

That’s nice in theory…

You may think that this is all theoretical, and there is no way you can actually implement something like this for yourself. Again, I’m going to prove that is not the case. Because of so many tools we have available to us now, and in conjunction of cheap storage and compute options in the cloud, this is a relatively simple fix in technical terms. Simply put, we can automate those refreshes on a timed basis, and have only marginal impact on overall spend depending on our tooling. The case can be made for higher ROI when we allow people to deliver better quality code and data at a faster and smoother trajectory.

My next two blogs will show off two examples of how to do this, and how you can move to a more productive and higher quality data development lifecycle.

Phil Goerdt is the founder of Erteso, a consultancy that focuses on cloud and data solutions. You can contact him at phil.goerdt@erteso.com.

--

--

Phil Goerdt

Consultant. Data & Cloud architect. Founder. Petrolhead. Traveler. Foodie. Geek. Opinions my own.