Data Ingestion aka Lake hydration

Denys Tyshetskyy
4 min readJun 30, 2024

Welcome back to our series on building a serverless data platform. If you haven’t seen the previous parts, please check out:
Serverless data lake Part 1 and Serverless data lake Part 2

In this part I am going to talk about the ingestion layer which is an entry point of your platform. Before we dive in, I’d like to note that I’ve decided to move away from the term “Data Lake” and instead use “Data Platform.” This change reflects my belief that the term “Data Lake” can be quite constraining, while “Platform” suggests a broader, more diverse applicability.

Data sources and corresponding AWS services

Data platform high-level architecture

Above is the architecture of the platform and the first 2 layers on the left are the subjects of our discussion — Data sources and Ingest.
It is nearly impossible to capture all the systems, which might serve as a data sources, so I will stick to the most often observed subset and the corresponding AWS ingestion services:

AWS has done an excellent job implementing functionality that simplifies the data ingestion process for data engineers. Often, there’s more than one way to implement functionality in AWS, so your choice for the ingestion services will depend on your platform’s SLAs, cost considerations, and other factors.

Implementation of the ingestion functionality

We built the Data Platforms based on the AWS CDK in multi-modular fashion. For example, the DMS ingestion functionality would be implemented as a separate stack and the same for the CRM systems, APIs etc. This approach allows us to:

  • Apply the separation of concerns principle
  • Reduce the blast zone of potential issues
  • Apply fine-grained access controls

Ingestion services would operate independently of each other however, in most cases, the ingested data would land in the same data platform layer — Raw. There are use-cases, when a pre-raw layer also known as landing, might need to be implemented. The use cases might be when data requires pre-processing step before it gets to the raw layer such as quality assessment, PII/PCI removal, formatting etc.

Some of the best practices around handling data from different data sources might be to try and make sure that the format of namespaces in raw layer as well as nesting levels are kept universal for all data sources.
For example, if we have data ingested from the database as well as an API system, it would be beneficial for the paths in S3 bucket to look similar to:

/dbname/year/month/date/filename.parquet
and
/apiname/year/month/date/filename.json

It allows to use the same Glue crawler for scanning both data sources. If nesting levels are different, the crawler would get confused and not be able to identify entities correctly.

Another important consideration is to maintain the “bookmarks” for the data that was ingested. It would guarantee that every time, only delta is ingested. When AWS-native services such as DMS or AppFlow are used, it is taken care of for you but for the custom ingestion implementation such lambda-based, it needs to be implemented by developer.
We found that mechanism of the full re-ingestion needs to be thought through in advance as well.

What about Zero-ETL

The functionality, provided so far, requires a certain amount of code for configuration and utilization which results in added complexity. Zero-ETL concept appeared with the promise to take away all or almost all of the ingestion pains. There is a number of implementations that exist now like some RDS DBs-Redshift however, the truth is that data is often complex, not clean and not well-formatted. For use cases like that, certain amount on custom ingestion needs to be done. The short answer would be — Zero-ETL can only take you so far but certainly keep an eye on it since the area is growing especially with widening adoption of GenAI.

Conclusion

  1. Don’t underestimate the importance of Data Ingestion as well as the time it would take to get it right.
  2. Build repeatable patterns of ingestion and utilize native services whenever possible.
  3. Pay attention to the quality of incoming data.

In the next part we will talk more on data orchestration, ELT and security of the platform.

--

--