Data Lakes on AWS (Part 2/5) — Data Storage and Cataloging

Joffrey escobar
TrackIt
Published in
5 min readAug 22, 2023

This article constitutes the second installment in a sequence of five articles dedicated to the implementation of data lakes on AWS. For readers that are new to this series, it is advisable to read the first article focused on data ingestion.

The subsequent sections below are centered on the exploration of methods for the storage of accumulated data, its categorization, and the establishment of a catalog to facilitate efficient referencing.

Object Storage for Data Lakes

The storage demands of a data lake necessitate addressing a range of challenges:

  • Ensuring cost-effectiveness given the substantial data volume and extended storage duration.
  • Accommodating various file types, encompassing structured formats (ex: CSV, Excel), semi-structured configurations (ex: JSON), and unstructured forms (ex: images, PDFs).
  • Scaling in accordance with data volume expansion.
  • Providing both durability and prompt accessibility for operational requirements.

Amazon S3 (Simple Storage Service) helps address all these requirements. Amazon S3 is a storage solution capable of housing an unrestricted amount of data. Crafted to ensure a 99.999999999% durability rate and 99.99% object availability throughout a year, S3 is equipped to fulfill the stated requirements. It boasts an array of robust security features, encompassing encryption and access control, while adhering to diverse regulations including SOC, HIPAA, GDPR, and more. S3 further streamlines data administration through lifecycle policies, intelligent classification, and tight integration with an extensive array of AWS services.

File Management

Having determined the optimal object storage location, the next consideration involves devising an approach to classify, arrange, and oversee the lifecycle of documents.

Within S3, files are structured into units called buckets. A bucket can be likened to a primary directory within your file system. Conventionally, it is standard practice not to establish just one bucket for the data lake, but rather multiple buckets contingent on operational requirements. For instance, a bucket could be designated for raw data, another for cleaned and normalized data, and so forth.

Within a given bucket, the organization of files into subfolders is facilitated by adding a designated prefix. A prevalent strategy involves incorporating dates into the file structure, exemplified by formats such as: /year/month/day/file1.json. By leveraging prefixes within file organization, numerous benefits are reaped, including optimized queries, prevention of redundant file processing, and enhancement of the maximum throughput within the S3 bucket.

Note: Throughput restrictions pertaining to an S3 bucket manifest at the level of prefixes. Therefore, introducing increased granularity into the file hierarchy — for instance, /year/month/day/hour — contributes to elevated input and output throughput within the bucket.

S3 also provides robust object lifecycle management. For instance, it is possible to define rules to move an object to S3 Glacier (a low-cost long-term storage class) after 90 days and then delete the object after 365 days. These rules help ensure compliance with data-oriented regulations.

Cataloging Data with Glue Data Catalog

Now that the data has been deposited within the S3 storage, it needs to be cataloged. This phase entails executing a process to scan the newly introduced prefixes within the buckets, thereby generating a comprehensive list of the newly added files and discerning their underlying structures.

This preparatory action helps circumvent a phenomenon often referred to as a “data swamp”. The objective is to establish an organized framework that allows for seamless access to critical information while adhering to essential security protocols.

The significance of this phase cannot be overstated. Without effective cataloging in place, the data lake could potentially transform into a cumbersome and expensive resource, diminishing its utility.

In response to this challenge, AWS offers a service called AWS Glue Crawler. This versatile service can be engaged either manually or automatically, and traverses S3 buckets, diligently discerning intricate file structures along the way. The acquired metadata is then systematically stored within the secure confines of the AWS Glue Data Catalog. This repository serves as a pivotal asset, furnishing the means to query the data via a user-friendly SQL interface, simplifying data transformation endeavors and enabling access control at both object and column levels. A comprehensive exploration of these capabilities will be undertaken in later articles.

Example

In order to elucidate the concepts, let’s consider a model scenario comprising a three-tiered pipeline: Bronze, Silver, and Gold.

  • The Bronze bucket is designated to house all the unprocessed data collected using the services and processes detailed in the first article of this series.
  • The Silver bucket hosts refined and standardized data, which has been cleaned, normalized, and compressed.
  • The Gold bucket contains enriched data optimized for direct consumption by data analysts and visualization tools.

In tandem with these storage spaces, a crawler helps traverse files stored within the Bronze bucket. The objective is twofold: to identify new additions and determine any modifications in the structural framework of the files.

With these components in place, the final step involves their assimilation into a Step Function, which effectively orchestrates the entire pipeline.

Step Function in the Visual Editor — Step 2

Upon executing the Step Function, the outcome reveals that the crawler has traversed the data and created logical tables with the structure of the files.

Note: It is possible to add a comment for each field for documentation purposes. This practice serves the dual purpose of enhancing documentation clarity and mitigating the risk of information loss pertaining to the specific content of each field.

Schema of the Google Calendar Table in the Glue Data Catalog
Data Lake Architecture Diagram — Step 2

Conclusion and Next Article

At this juncture, data stored in the data lake has been meticulously cataloged to facilitate subsequent retrieval. The completion of this step sets the stage for the next article, which will detail the process of optimizing and enriching data to make it available for analysis.

About TrackIt

TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA.

TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.

In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.

--

--