Is Your Data Lake More like a Used Book Store or a Public Library

Seshu Adunuthula
Intuit Engineering
Published in
4 min readJan 12, 2021

Does your data lake resemble a used book store or a well-organized public library?

Here at Intuit, we had to grapple with this question when our senior vice president asked a few simple questions: What is the rate of the growth of data in our data lake? Which areas/domains are growing faster than others? Are there gaps in certain domains we should be aware of? It was hard for us to gather such insights, given our limited understanding of the sheer volume of data we had in our data lake.

Don’t get me wrong, everyone enjoys the occasional weekend perusal for hidden gems in a used book store, but the “used book store experience” should be the exception — not the rule — for data scientists, data analysts and data engineers searching for just the right data set for their business use case.

This started us on a journey to understand what steps we could take to better manage our data lakes: 1) how to catalog the lake, 2) how to manage the inventory, and 3) how to address compliance needs.

Cataloging the Data Lake by “Genre”

image source: https://reference.yourdictionary.com/books-literature/different-types-of-books.html
image source: https://reference.yourdictionary.com/books-literature/different-types-of-books.html

When entering a library, we may not know the exact book we are looking for, but we know its genre and sometimes its author. This helps us quickly identify the section of the library we need to visit for further exploration. Can we apply the same experience to our data lake users? Luckily, our Intuit “City Map” enterprise architecture gave us an excellent starting point. This structure organizes our technology portfolio through a “capability” lens according to a simple principle: “A place for everything and everything in its place.”

We’re classifying each and every object in the data lake by capability group in the City Map. By categorizing billions of objects in our data lake, we’ll streamline the search experience for Intuit data scientists, data analysts and data engineers, just as a well-organized book stack does for library visitors. Ultimately, this will help us to measure key performance indicators (KPIs) against the data lake. For example, which “genre” has the most growth with respect to data volume? Which is most heavily used? What types of workloads are using the data? For example, exporting, data-driven applications, experimentation, AI/ML, etc.? And, which areas have gaps?

Managing the Inventory

One of the big challenges we have with the data lake is discarding aged or rarely-used objects, which clutter up the lake, pollute our searches and make the exploration process inefficient. That’s why we’re beginning to discard obsolete, unused objects in the data lake to make space for new, prominently displayed objects (new books) and frequently-used objects (best sellers). To that end, we’re investing in a data management UI technology and combine data lake usage stats with the data catalog to simplify the search experience

Access Control and Compliance

image source: https://www.teacherspayteachers.com/Product/Reference-Material-Scavenger-Hunt-4407372

Imagine if you had a separate checkout counter for each genre, and each was governed by a different set of rules. Our data lake was governed by an ad-hoc set of rules for access control and compliance across multiple data lake accounts. We’re in the process of streamlining access to the data lake and also bolstering access control and audit mechanisms for compliance. A central tool enables our data lake users to request access across our distributed data lake with approval workflows from the data owners. Owners of the data sets are able to define fine grained data protection policies and attach approval workflows.

Today, we’re well on our way to answering the simple questions posed by Intuit’s senior vice president to uncover key insights from our data lake. And, we’re proud of our progress to date in this monumental, company-wide priority. As the usage of our data lake by Intuit data scientists, data analysts and data engineers continues to increase, we’re committed to making it easier than ever to browse, search and access data sets for specific use cases. Just as straightforward as it is to find just the right book in the stacks at a local public library.

--

--

Seshu Adunuthula
Intuit Engineering

Seshu Adunuthula is the Head of Data Lake Infrastructure at Intuit. He is passionate about the Next Generation Data Platform built on public cloud.