Data Lake vs. Data Warehouse — How we Chose the Right Solution for our Data Insights Offering
In the search for the optimum architecture for our new Data Insights offering, TrackIt embarked on a rigorous evaluation process to compare the suitability of a data lake versus a data warehouse implementation. The objective was to design a data repository solution that would enable advanced analytics, provide deep insights into optimizing AWS infrastructure, and showcase TrackIt’s expertise in implementing comprehensive data workflows.
Note: Readers looking for a comprehensive guide that outlines the differences between a data lake and a data warehouse can read the recently published Data Lake vs. Data Warehouse article which compares both solutions across eight different qualifications.
Data Insights Project Requirements
- Design and implement an internal data repository to efficiently store, process, and analyze project time-tracking and sales data.
- Showcase expertise in implementing data workflows
- Develop a practical tool that helps manage engineering teams across current and future projects by leveraging insights derived from the data repository.
- Optimize resource allocation and improve productivity by analyzing engineering hours and task data.
- Integrate advanced analytics and data science techniques to generate valuable insights and facilitate data-driven decision-making.
- Streamline the management of engineering hours, off time, and project allocations for better overall team performance.
- Establish a scalable and flexible data infrastructure that can accommodate future data sources and analytics requirements.
- Demonstrate the utility of AWS data analytics services when building data platforms
Choice #1: Data Lake Implementation
Solution Architecture
Data Lake Architecture Diagram
Data Lake Pipeline
- Data Ingestion: An AWS Lambda function is triggered by an Amazon CloudWatch Event to ingest time-tracking data. This raw data is stored in an Amazon S3 bucket (landing zone). AWS Glue Crawlers catalog the data in the landing zone, making it searchable and discoverable.
- Data Cleaning: An AWS Glue job is utilized to clean and reformat the raw data, improving its quality and making it easier for subsequent analysis. The cleaned data is stored in a secondary S3 bucket (curated zone). AWS Glue Crawlers catalog the data in the curated zone as well.
- Data Transformation: An additional Glue job is employed to further process the data for business insights by aggregating and transforming it. The transformed data is then stored in a third S3 bucket (production zone). AWS Glue Crawlers catalog the data in the production zone, ensuring efficient organization and management.
- Data Visualization and Querying: The processed data can be visualized using Amazon QuickSight and queried using Amazon Athena for further analysis and decision-making.
Additional Features
- Data Archiving: The data in the first S3 bucket (landing zone) can be archived in Amazon S3 Glacier for long-term, cost-effective storage
- Access Control and Management: AWS Lake Formation is responsible for managing data transformations and permissions, ensuring a secure and organized data lake.
- Data Catalog: AWS Glue Crawlers are used to catalog the data in each bucket (landing, curated, and production), which is important for efficiently organizing and managing the data.
Data Lake Costs
The cost estimation presented below is based on the following assumptions: One data refresh per day, with 10 accounts (including 1 editor) having access to view visualizations for a maximum of 4 hours per day.
A detailed breakdown of costs for the data lake can be found on the AWS Pricing Calculator.
Note: cost estimates provided are based on currently known elements and may vary depending on factors such as changes in usage patterns, data volumes, or refresh rates.
Choice #2: Data Warehouse
Solution Architecture
Data Warehouse Architecture Diagram
Data Warehouse Pipeline
- Event Trigger: A CloudWatch Event initiates the process by triggering an AWS Step Function.
- Data Ingestion: The Lambda functions extract pertinent information from all data sources, subsequently transforming it into a columnar format.
- Data Storage: The retrieved data is stored in an Amazon Redshift database.
- Data Visualization: Amazon QuickSight queries the Redshift database and creates visualizations based on the data.
Data Warehouse Costs
The cost estimation presented below is based on the following assumptions: One data refresh per day, with 10 accounts (including 1 editor) having access to view visualizations for a maximum of 4 hours per day.
Considering the small volume of data (<100 GB) and computational needs (2 vCPU) required for this use case, a smaller Redshift instance (dc2.large) was chosen. However, it is important to note that Redshift costs rise sharply depending on the volume of data and computing needs.
A detailed breakdown of costs for the data warehouse can be found on the AWS Pricing Calculator.
Note: cost estimates provided are based on currently known elements and may vary depending on factors such as changes in usage patterns, data volumes, or refresh rates.
Solution Chosen: Data Lake
TrackIt opted for a data lake as the chosen solution for its Data Insights offering. This choice was made due to the scalability, flexibility, and cost-effectiveness offered by data lakes, particularly for larger implementations. A data lake allows for efficient storage, processing, and analysis of substantial volumes of both structured and unstructured data without a significant increase in costs. This aligned with the initial goal of building a solution that can accommodate future data sources and analytics requirements.
Conclusion
In the example outlined above, the decision to implement a data lake was driven by the need for scalability, flexibility, and cost-effectiveness. Data lakes provide the advantage of accommodating growing data sources, allowing for efficient storage, processing, and analysis of data without steep increases in cost. However, if the project objective was to ensure optimized performance, structured data insights, and simplified maintenance, data warehouses would have been a more appropriate solution.
This highlights the importance of conducting a thorough evaluation before choosing a solution. The choice between a data lake and a data warehouse depends on multiple factors such as workflow requirements, the volume of data to be stored and processed, and the degree of flexibility required. A careful examination of workflow needs, advantages, and potential trade-offs of each solution can help in making the right decision.
About TrackIt
TrackIt is an Amazon Web Services Advanced Tier Services Partner specializing in cloud management, consulting, and software development solutions based in Marina del Rey, CA.
TrackIt specializes in Modern Software Development, DevOps, Infrastructure-As-Code, Serverless, CI/CD, and Containerization with specialized expertise in Media & Entertainment workflows, High-Performance Computing environments, and data storage.
In addition to providing cloud management, consulting, and modern software development services, TrackIt also provides an open-source AWS cost management tool that allows users to optimize their costs and resources on AWS.