Is Your Data in a Lake or Water Treatment Plant?

Why data isn’t that different from other consumable resources

They say to write about what you know. For the first part of my career and through my formal education, my work was all based in the world of water and sanitation. It’s glamorous stuff- pipes and pumps, fecal coli-forms and treatment options, toilets and activated sludge. It made great dinner conversation (or so says my wife). At some point, I made the transition from a water guy who worked with a lot of data to a data scientist who knew about water. And as I dive deeper into the analogies and lingo of modern data systems and tools, I lean on my engineering training to make sense of it all.

The hilarious (read:cringe-worthy) jokes in the board room by executives about “let’s just send divers into the data lake” really misrepresent what a data lake is — and more broadly the entire process of churning raw data into meaningful insights. And as we strive to improve data literacy for everyone we work with, we really need to find a better analogy.

To start, our data is like water. It brings sustenance to our businesses and the life blood of our organizations. But our water can be full of sh*t and bacteria hidden to the naked eye. Lakes are ecosystems with tributaries and streams that feed into them. They can be used for fishing, water skiing, drinking supply or just a nice thing to look at. Which is true of a lot of our data.

Maybe your data lake is the reservoir where things are cold-stored. However, lakes don’t deliver water to users. They don’t deal with water quality. They can be a single part of the larger delivery network. But what if your data comes from groundwater? What if it’s a rainwater collection system as data drops from the sky into your business? We need to embrace the entire system that takes our raw resource and delivers it through pipes and systems to end users.

Data Lakes are actually water treatment plants.

The boardroom picture of a scuba diver in a lake does no justice to the complexities of making a data lake functional in today’s business intelligence and machine learning world. As @OneAngryPenguin explained during a recent overview of the Azure Data Lake Store, it’s really three components/parts under the hood: Raw Data Storage, Data Cleaning & Transformation (ETL), and Processed Data. Just like a multi-stage water treatment plant. We take in whatever raw data we can, clean it up, and send it out for consumption by households, businesses, or the public. Primary, secondary, tertiary treatment- we scrub out Personally Identifiable information, filter for nulls and missing values, and bring together data from different sources to meet the supply demands downstream.

Created by Ben Mann

Data architects and system engineers are the designers and city planners who help draw up the infrastructure and route water to new neighborhoods. DBAs, SysAdmins, and developers are the O&M utility crew who maintain the pipes and keep the water flowing. Data Scientists are the water quality team, helping clean and pull insights out that customers drink. Everyone has a role to play in keeping data flowing from source to tap.

Tools like Hadoop, SalesForces, or MongoDB may seem complicated to anyone who doesn’t work with them daily, but our data systems really are eerily similar to other municipal infrastructure we’ve dealt with for decades and know all too well. And just like when we don’t plan/govern/regulate water systems, our data systems can end up with price-gouging private wells and data-hogs who tap aquifers dry. Data Governance teams make sure we’re consistent in how we treat the water and ensure equity in access for all. They keep the rest of the process in check. Like the EPA and city inspectors, our data needs quality assurance checks and periodic audits to make sure no one will get sick from our insights and that our businesses won’t die of water borne diseases.

Ultimately, no one wants to drink raw sewage, right? End users want clean water, in their kitchen, available at a moments notice from the tap. They don’t need access to the main trunk line- they need fit-for-purpose data and insights that are accessible and usable. The effectiveness of the water treatment plant is judged by those at the end of the line- be it a VP, project manager, client, or major distribution hub Data Warehouse. Some of our data will be used for cooking or for putting out fires. The route it takes may vary. But the principles to get raw data through the system to safely meet demand are almost always the same.

Because in the end, our data is just like water.