How to Keep Data Lakes Clean and Actionable
Data lakes are a big opportunity to store large amounts of data in an affordable way without having to decide upfront how it must be structured and used. They are typically used to complement traditional data warehouses, which are still better adapted for highly-trusted, tightly-governed data such as your financial figures, but there are some overlaps between the two compositories.
Data lakes compared to data warehouses are analogous to spreadsheets compared to traditional business intelligence tools. One of the reasons spreadsheets remain so popular is that you can do whatever you like with the data, free of annoying restraints (like not being able to tweak the disappointing sales figures!). Data lakes bring information together from lots of different sources, like traditional data warehouses — but offer more autonomy for users and far fewer constraints.
This independence has some big upsides. Data lakes provide the perfect environment for fast, iterative analytic experimentation with varied data sets. Huge amounts of information can be stored and used to uncover deep correlations to inform product or marketing strategies, and more. And anyone in the business can generate queries for themselves, without IT as a bottleneck.
But the data lake approach can also have some big downsides. Just like spreadsheets, too much freedom can lead to problems. Without governance, people have the flexibility to do things incorrectly, in ways that can be hard to detect and correct. The result can be “data dissonance” — multiple pools of duplicate or erroneous data. Different teams may end up needlessly recreating the same analytics from scratch, using different — and incompatible — definitions of key business terms.
There’s no magic solution to getting high-quality data. Data lakes need to be governed and maintained; if not, they can easily turn into data swamps full of stagnant data of dubious provenance — making it hard to glean useful insights.
Here are some key steps to ensuring your data lake meets your business goals:
Assign data owners. Any valuable resource is liable to be squandered unless ownership is clearly defined. Good governance comes down to people, not technology. It’s important to have a company-wide program to ensure that every important data source has an identified owner with the responsibility, incentives, and resources necessary to maintain high-quality information.
Keep track of what’s in there. Various solutions are emerging that allow organizations to have a clear understanding of what is available in the data lake. These insights provide not just metadata and technical information, such as which system it originated from and when it was uploaded, but also intelligence about the owner, a data quality rating and more. New collaborative solutions are emerging that allow for crowdsourcing — a sort of “Yelp for data,” where users can vote and make comments on a catalog of different data sources. Regular data curation is essential to ensure that the information remains relevant and up-to-date and that overlapping data from different teams is minimized.
Establish clear data retention policies. Just because you can keep data forever doesn’t mean you should. As both lawyers and data scientists will tell you, more data is not necessarily better. Every industry and jurisdiction has different requirements for data retention, so be sure to double-check that you are maintaining compliance. However, while data storage costs continue to plummet, it can still be prohibitively expensive to store, for example, all the raw data from all your IoT sensors.
Posted on 7wData.be.