Preventing the Data Lake Abyss

How to ensure your data remains valid over the years.

Published in

97 Things

3 min readMay 22, 2019

Everyone has worked under the wrong assumptions at one point or another in their careers and in no place have I found this more apparent than when it comes to legacy data and a lot of what ends up in most companies’ data lakes.

The Origins of the Data Lake

The concept of the data lake evolved from the more traditional data warehouse, which was originally envisioned as a means to alleviate the issue of data silos and fragmentation within an organization. The data warehouse achieved this by providing a central store where all data can be accessed through a traditional SQL interface and with common Business Intelligence tools, as long as the data as been loaded ahead of time. The data lake, while commonly coupled alongside data warehouse technology, is simply an elastic storage solution enabling data producers to dump raw data (unstructured or structured).

The initial value proposition for the data lake was the thought that “the intrinsic value of an organizations data is higher than the cost of storing the data in the cloud (or on-prem via HDFS)”. However, not all data that ends up in the data lake will be used, and what is worse, is that some data will sit unused for years.

Polluting the Data Lake

Over the course of many years what begins with the best of intentions can easily become a black hole for your company’s most valuable asset (the data). The most obvious culprit being slowly changing, non-backwards compatible mutations to the structured data being stored over time. Effectively a breakdown in data management processes render older data corrupt and unusable.

This problem seems to arise from three central places:

A basic lack of ownership for the team producing a given dataset.
A general lack of good etiquette or data hygiene when it comes to preserving backwards compatibility with respect to legacy data structures.
Starting off on the wrong foot when creating a new dataset with no support from other Data Modeling or Data Engineering experts.

I found that a simple approach to solving these data consistency issues come from establishing what I call data contracts.

Establishing Data Contracts

The ever growing need for quality data within an organization almost demands an upfront establishment of data contracts ahead of time. This means taking things further than just adhering to and providing a schema for your data but also generating a plan or story for what fields exist when, where and why.

This contract is your data’s api and should be updated and versioned with any change in your producing code. Knowing when a field will exist or not saves people time and reduces frustration.

From Generic Data Lake to Data Structure Store

Taking this idea to the next level requires the data producing team to have compiled libraries for standardizing how producing and consuming works with respect to their data. These libraries should follow engineering best practices including unit tests to ensure full backwards compatibility across changes to the underlying structured data.

A few commonly used (serializable) structured data frameworks are Apache Avro and Google Protocol Buffers. Both allow you to define your data schemas in a platform and language neutral way, giving you the type safety you’ll never have with traditional JSON.

By versioning (release versioning) these compiled libraries, you ensure (by convention) that each byte of data stored within your data lake adheres to strict validation and is implicitly accountable to the current (and historic) versions of the data contract. This establishes a simple rule that each record will be preserved in a versioned format that will be at the very least backwards compatible and in a format that can easily be extracted and used for years to come.

This upfront effort will lead to productivity across all data consuming entities in your company. Instead of roadmaps being derailed with long periods of let’s play find the missing data (or literally jettisoning or draining the lake to start over due to a general lack of trust in the data integrity), you can instead get back to having a data lake that is an invaluable shared resource that can be used for any of your data needs: from analytics to data science and on into the deep learning horizons.