Serverless Data Lake in AWS — Part 1

Introducing Data Platform

4 min readJul 24, 2023

In CyberCX data practice we work with customers to help them to achieve their Data goals which would help to drive their business forward.

During last 2.5 years we came to conclusion, that even though the business requirements of every data project are often unique, the underlying foundation is very similar. That observation guided us towards the development of the framework, which can be used for multitude of the Data lake/lake-house solutions in AWS.

We called the platform CCX Serverless Data Lake Platform (CSDLP) and in this set of articles I would describe the main features of the platform as well as the use cases that it helped us to solve.

Why companies need a Data Lake/Lake House

It is needless to say that we, as individuals, consume and generate dazzling amount of information. One of my favorites is that 74 Gb that we consume daily, is what top scientists consumed through their lifetime in 1500s. The same trend is applicable to the companies from all sorts of business verticals. They collect and generate information on the unprecedented scale and it is only going to grow. The issue is that a lot of that information is not being used to its potential. It might be looked at/analyzed for a bit and then it goes to die somewhere in the silo storage. There is an alternative though — data can serve the company for a long time and help with discovery of all sorts of interesting insights.

Search for Data nirvana

There is an ongoing search for the solution that will help organisations to harness their data and one of the first trends was Big Data

Looking at the trends from google, it is obvious that the interest in that area started to cool down. Few reasons I see are that Big Data stack technologies don’t necessarily answer the question of the effective data access, governance, optimized cost. That’s when the new term started making it’s way into the mainstream — Data Lakes and Data Lake house

Architecture and Tools

As a developer, I like a tangible concepts. It is good to have a unifying term and a high level architecture but in the end of the day the workload needs to run somewhere and data needs to be stored using certain mechanisms.

Few years ago, AWS introduced the concept of Modern Data Architecture. It sets the expectation for what the future-proof data platform needs to look like. At the same time AWS is on the mission to bring the services, that will fit into the puzzle of the architecture mentioned above. On top of that, they also added the Serverless concept into the mix, which so many customers are relying on. The result of the endeavor is:

Adam Selipsky at the 2022 Re:invent talking about the D&A Serverless family

With CyberCX being an AWS Premier Partner, we work with the services mentioned above on the daily basis. We know that it’s good to have a good toolset but it is even better to be able to have a platform, where those services can be easily mixed and matched so the the best tool for the job is just a click away. Developers need to be able to easily provision required services into their AWS infrastructure and with the same ease support them going forward. The tools also need something to be used on — in our case it is data which can be flowing from all sorts of different sources. We need to organize the ingestion process and make it as repeatable and automated as possible. This is where CCX Serverless Data Lake Platform comes in.

Serverless in nature

Our attempt to use only Serverless services from AWS stack in our platform is based on the desire to reduce the necessity to manage the underlying infrastructure as much as possible. We prefer to concentrate on solving business challenges and leave the management of the servers to AWS who are pretty much unmatched in that. Customer only pays for what is used.

In the next article I will talk about the CSDLP in more details

References:

https://kids.frontiersin.org/articles/10.3389/frym.2017.00023