How to make your Data Swamp a Data Lake

What is a Data Swamp?

Reasons for why you should avoid it

Christianlauer
CodeX
Published in
3 min readDec 28, 2021

--

Photo by Mark Landman on Unsplash

By definition, a Data Swamp is an unmanaged Data Lake that is either inaccessible to intended users or provides little value. Data swamps occur when adequate data quality and data governance measures are not implemented. Sometimes a Data Swamp can also arise from a Data Warehouse due to existing hybrid models.

What is a Data Lake again?

To explain the emergence of Data Swamp in more detail, it is first necessary to understand the concept of a Data Lake. A Data Lake is a large pool of raw data for which no use has yet been determined. A Data Warehouse, on the other hand, is a repository for structured, filtered data that has already been processed for a specific purpose [1].

Hybrid Data Lake Concept — Image from Author

What Problems can occur?

If a Data Lake holds too much data in a poorly organized manner without suitable metadata management and a reliable data governance, relevant data becomes increasingly difficult to find. The information content of the Data Lake decreases, even though new data is…

--

--

Christianlauer
CodeX
Editor for

Big Data Enthusiast based in Hamburg and Kiel. Thankful if you would support my writing via: https://christianlauer90.medium.com/membership