How Data Flows Through Lakes

Published in

Ancestry Product & Technology

5 min readOct 3, 2022

Summary: This article is for data engineers who would like to gain a deeper understanding of how data is stored as it flows through a data lake.

As Ancestry has been building their Data Lake questions about how and where data should be stored. In addition to what and how many transformations we do between each phase. Thus we sought after best practices standards and strategies that we can adapt in order to iterate faster in the future.

For data engineers, an image similar to the one shown above with Bronze, Silver, and Gold quality levels might come to mind upon hearing the buzz-word “Data Lake.” Having experienced building a Data Lake, though, data does not flow from Bronze to Silver to Gold and then to clients as easy as that. Data will typically go from Bronze to Silver rather easily. However, there might then be many different transitions before it goes to consumers: Silver data to Gold, Gold to Gold, Gold mixed with Silver, and then back to Gold. This might seem like a given, but it is not always talked about.

Articles from Snowflake and dbt do address this, and it was their architecture and best practices that made it clear what was missing from typical Data Lake architecture. The most critical missing concept in common architecture is a Working Zone or a place to temporarily store data. This is because data transformations can’t always take place in a single step.

In addition, the terms Bronze, Silver, and Gold are intended to convey the quality level in each bucket. Those who have worked with Data Lakes will agree that those terms don’t accurately represent the data. Data that comes from the source system of record should already be high quality. They are the source of truth. All the Data Lake is doing is replicating what they are producing. Thus, going forward, Bronze will be referred to as Landed, Silver as Refined, and Gold as Processed.

Explanation of Zones

The core components of a Data Lake can be grouped into something called the Platform Data Zone which contains three areas: Landed, Refined, and Processed. Each of these zones could be database tables or BLOB storage such as S3. These are the central resources for storing data for long-term use.

Landed Zones are for data coming in from the sources. Then one or more teams that has domain expertise handles ingesting and transporting the data to the Refined Zone. Clients shouldn’t be pulling from Landed Zones. Only the teams managing the data should do this. There are only minimum transformations from these two locations. Transformations such as type casting, renaming columns, adding diagnostic columns such as update times or version numbers, formatting, and partitioning strategies.

Refined Zones are a good starting point for finding data sources to build data models. The data found here closely matches that of the root source of the data.

Example of Building a Processed Model from Multiple Refined Sources

Processed Zones are often the most interesting. What is found here is not data from other systems but data built on top of those refined sources. These can be data models that are customized for specific consumers or generic models that multiple systems use.

Just because a data model is in the Processed Zone doesn’t mean that this is all the data needed by consumers or that the transformations are done. It means that the data is managed and marked as available for use. A processed data model can supply multiple downstream models. The benefit is saving duplicate processing, mix-matched data, and cost.

Working Zones are outside of the Platform Data Zone where consumers can go to find data. It is recommended that every team doing transformations have a working zone. This could be their own S3 bucket. What’s missing in Example 2 above is the intermediate stages. Sometimes the data from the sources are too large for the use case or the data requires multiple complex transformations that don’t fit into one transformation step.

Example of using the Data Lake for a ML Pipeline with a Working Zone

Consumer Usage Example

Take a machine learning pipeline, for example. ML can be costly to run, so it makes sense to choose wisely when selecting which customers to include or how many predictions are made. Ancestry Hints is a good example. Ancestry has tens of billions of Hints for customers. A single user could have a million plus hints. To rank these billion hints over and over for a large customer base would be costly. A good solution would be to have a heuristic that selects only new generated hints or some other algorithm to make predictions.

With that in mind, one needs to extract the subset as shown in the diagram. Select the same subset from all the source data. Then one needs to do some transformations to prep the data for the machine learning endpoint. After sending the request, a result back is received. This data should be available for everyone, so the hint identifier is stored with the model result into the processed zone.

The following question still remains: where do we store the extracted hints and model inputs while the team is actively transforming the data? It’s the working area that this team owns and manages. The extracted data is only useful to the team making the predictions, and after the predictions are made, it can be deleted. It’s important that no consumers should come to the extracted dataset for data, though. They should use the same source as this team did.

One of the early adopters of the Data Lake at Ancestry was our Data Engineers that used it to make machine learning inferences. The example described was very similar to their approach.

Final Thoughts

This is the architecture that’s been built and explained how all the pieces are interacted with. When a dataset is created, it’s very clear for Ancestry data teams now to see which criteria it meets and, therefore, where the data should live. Going further, there is still work to be done on making queries and creating new datasets easy for all consumers of any skill set.

If you’re interested in joining Ancestry, we’re hiring! Feel free to check out our careers page for more info. Also, please see our medium page to see what Ancestry is up to and read more articles like this.

How Data Flows Through Lakes

Explanation of Zones

Consumer Usage Example

Final Thoughts

Written by Thomas Cardenas