Serverless Data Lake in AWS — Part 2

Base lake and CI/CD

3 min readSep 12, 2023

In the the first part I talked about the benefits of Data Lakehouse architecture, AWS commitment to continuous development in Data and Analytics space and why Serverless is a good thing.

In this part I will cover CyberCX Serverless Data Platform in detail as well as the deployment of it.

There might be a question why Data Platform and not Data Lake or Lakehouse. This Platform includes all the characteristics of the Lakehouse but it is more than that. It goes beyond tasks that would be performed by Lakehouse and adds capabilities like data mesh when multiple lakes can be combined using the hub or custom capabilities like data masking, data serving etc.

CyberCX Serverless Data Platform (CSDP) has been designed and implemented with scalability, extendibility, security and operational excellence in mind and Modern Data Architecture principals at the core. It is built using AWS-native services such as Glue, Lake Formation, Redshift Serverless and more. The main idea is that the Platform brings individual AWS services together according the the use-case in the secure, cost-effective and performant way. As a result, serving the purpose of accelerating the time-to-market from the inception to production-ready.
For the base platform, we are looking at 10–15 days to provision to production environment. It includes core components like S3, LakeFormation, Glue templates, Data Catalog, and ingestion of a number of data sets.

On a diagram above we can see that data from multiple different sources is ingested into the lake, it makes its way through different storage layers (raw, processed, curated) and in the end data is used for various analytics purposes. Depending on the particular use-case Data Platform can utilize other services like EMR, Airflow, OpenSearch, RDS DBs, DynamoDB etc.

Now let’s talk a bit about deployment of all this. One of the challenges with cloud-based systems is consistency between different environments (dev, test, prod). Automated and reliable deployment is the foundation of seamless transitions between those phases. Following the best practices of Infrastructure as a Code(IaaC), Data Platform implementation is fully code-based. It is written using python and we support both AWS CDK and Terraform.

CCX Serverless Data Platform consists of multiple stacks which allow the modularity and extendibility. For example the storage layer of S3 buckets and all related resources are provisioned as one component, then Glue jobs as another, Redshift as one more etc. This composition allows to add new components when it is required and without any disruption to existing infrastructure.

In the diagram above you can see CI/CD process. Platform supports all popular code repositories like Git, BitBucket, Azure DevOps. Development is no different from the usual software development. When change needs to be done, developer pushes the local change to the corresponding branch in the repository, which triggers the AWS CodeCommit pipeline and the change is deployed to the necessary environment. In this way change can be tested before eventually the change is merged to the prod brunch and prod release completes the cycle. This approach guarantees that all environments are identical and the development process is a breeze.

In the next part we will talk in detail about the data ingestion stage.

Thank you!

Serverless Data Lake in AWS — Part 2

Base lake and CI/CD

Written by Denys Tyshetskyy