Organizing S3 as a Data Lake: Insights from My Latest Project

Efficient Folder-based Organization for Data Management and Lifecycle on Amazon S3

Published in

Art of Data Engineering

2 min readApr 9, 2024

I recently worked on a real-time pipeline, using s3 as a datalake to store raw and analytic data. You can get more information on the project by following this link :

Building a Real-Time End-to-End Streaming Pipeline Project

A Step-by-Step Guide to Building a Streaming Pipeline

blog.det.life

One of my colleagues challenged me regarding the structure of the Datalake and how the new entry data was processed.

That was a big eye-opener for me. Considering the architecture I have used for the project, substantial data issues could arise over time as the project grows. Those issues could be :

Double processing of data arriving in the Datalake: I added an orchestration part that triggers and runs the glue job whenever data gets into an s3 bucket.
No information on error data: The transformation may generate errors after running the glue job, but we need information about it.

These two critical steps have yet to be implemented in my project.

The solution is to modify the project management strategy (s3 data lake part) to ensure a good data lifecycle and control of the processed data.

I decided to follow a folder-based Data Lake Architecture. By creating three folders :

Raw folder: In this folder, I will send all the new entry data from firehose
Archive folder: After processing, I will copy all the data processed from the Raw folder to the Archive folder. This will help because when new data is added to the Raw folder, the glue job will not re-run old data but only the newly added one.
Error folder: In case of error, for example, if the glue job fails at some point, then the data that hasn’t been executed will be sent into this folder.

Final Thoughts

In this article, I present a quick reflection on how I can use a folder-based datalake architecture to handle my project's data lifecycle management strategy.

Let me know what you think about this. Is there a better way to do so?

I hope it helps; thanks for reading :)

Please reach out via LinkedIn, GitHub and Medium. All comments are appreciated.

Organizing S3 as a Data Lake: Insights from My Latest Project

Efficient Folder-based Organization for Data Management and Lifecycle on Amazon S3

Building a Real-Time End-to-End Streaming Pipeline Project

A Step-by-Step Guide to Building a Streaming Pipeline

Final Thoughts

Written by Lorena Gongang