Which Data Architecture Should I Choose for My Workplace? — A Data Engineer’s Approach
In today’s world, data has become one of the most valuable assets for organizations, playing a crucial role in making strategic decisions, optimizing operations, and gaining competitive advantage. In this context, data engineering is a critical discipline that manages and guides the entire process, from data collection to transformation, storage, and making it accessible.
In the age of big data, businesses are not only required to own data, but also to interpret, make it accessible, and integrate it into decision support systems. This requires developing new and more flexible solutions related to data processing and management. As the volume, diversity, and use cases of data continue to increase, organizations are turning towards architectures that can respond to various needs. In this context, data management strategies, such as Data Warehouse, Data Lake, Data Lakehouse, and Data Mesh, play a significant role. Each approach offers different solutions in terms of data type, access model, performance requirements, organizational structure, and governance policies. Data Warehouses focus on structured data, while Data Lakes offer a more flexible structure for storing large volumes of data. Data Lakehouse, on the other hand, combines the advantages of both approaches, creating an optimized environment for data analytics. Meanwhile, Data Mesh aims to decentralize data management with microservices architectures, allowing for more efficient distribution of data responsibility in large organizations.
However, the foundations of a successful data architecture must be laid right from the beginning of the design process. This is not just about building a technical structure, but aligning it with organizational goals and data management strategies. This article explores both the theoretical details of these processes and provides an example project that demonstrates how to build such a system.
Requirement Analysis: The Cornerstone for a Successful Data Architecture
Starting the process of building a data architecture with the right approach is critical to preventing potential issues that may arise in the later stages of the process. Therefore, the first and most important step to take at the beginning is requirement analysis. If the needs are not clearly defined, starting the project with the wrong architecture will lead to both resource and time wastage.
Before embarking on a data architecture project, it is crucial to have a clear understanding of what exactly is being built. Not every data architecture project follows a standard template. Each organization’s data structure, business goals, expectations, and user needs are unique. Therefore, the technical team responsible for building the data architecture must work closely with the relevant business units and stakeholders to clarify the scope.
For the purpose of this article, an example project workflow has been defined. Based on the needs, an architecture was chosen, and the process continued with that choice. However, other alternatives have also been discussed from an explanatory standpoint.
In this example project, the goal is to create a modern Data Warehouse to consolidate sales data and perform meaningful analyses on this data. The platform used could be any DBMS (such as MS SQL, PostgreSQL, etc.). The primary objective is to strengthen data-driven decision-making mechanisms, simplify reporting, and produce business insights.
What is the Purpose of Requirement Analysis?
Requirement analysis is performed to:
Understand business needs,
Identify stakeholder expectations,
Clarify the scope, and
Choose the correct technological infrastructure.
In the example project, it is assumed that data flows from two main source systems (ERP and CRM). ERP (Enterprise Resource Planning) is a software system used to integrate all business processes and resources of a company. This software manages a range of business functions such as finance, human resources, production, logistics, sales, and marketing together. The goal of ERP systems is to make processes more organized and transparent by efficiently using the company’s resources (time, labor, materials, capital, etc.). On the other hand, CRM (Customer Relationship Management) is a software system used to manage and improve a company’s relationships with both existing and potential customers. CRM software collects and analyzes customer data, helping develop more personalized and efficient strategies in sales, marketing, and support processes.
When data is provided in CSV format from ERP and CRM systems, the use of file-based data sources requires careful planning and robust data control throughout the ETL (Extract, Transform, Load) process. Raw data is often incomplete, corrupted, or inconsistent, which necessitates cleaning and resolving quality issues before analysis can take place.
Cleaning the data alone is not enough; it must also be integrated into a user-friendly and understandable structure. The data model should be simple, logical, and designed in a way that supports analysis. In the example project considered, historical data tracking was not required. This means that only the most recent records would be considered during data loading, making the system simpler and faster. This preference was made to streamline the explanation and simplify the model.
Another critical requirement is providing clear and comprehensible documentation for the data model that will be generated at the end of the system. This documentation ensures that both technical teams and business users can adapt to the system more easily. This document, which explains how the data warehouse should be used, the purpose of each table, and how relationships are established, directly impacts the sustainability of the project.
In summary, in this project:
SQL will be used,
Data sources will be provided from ERP and CRM systems in CSV files,
Data will be cleaned and transformed into a user-friendly model,
Only the most recent data will be used,
A detailed documentation will be prepared as the final outcome.
Once this analysis is completed, the structure of the project will be clear, and the next phase, the design of the data architecture, will begin. A solid requirement analysis is the cornerstone of all data projects.
Designing the Data Architecture: Creating the Right Structure
Data architecture design is a crucial step that directly impacts the success of a data warehouse project. This phase defines how the data warehouse will be structured and how the data will be processed. A well-designed data architecture facilitates the flow, integration, storage, and access of data. However, there are several approaches to designing data architecture, and selecting the right method should align with the project’s requirements and objectives.
Data Architecture Options
The choice of approach in data architecture design varies depending on the project’s goals, data types, and intended use. Each approach has its advantages and challenges. Therefore, it is essential to thoroughly examine the fundamental characteristics of each approach and understand under which conditions they are most appropriate.
1. Data Warehouse (Data Warehouse)
A data warehouse is typically a structure where large and structured data is collected, optimized for analysis and reporting. In SQL-based systems, data is organized in a specific structure, and business intelligence applications work with this data. Data warehouses generally have the following features:
- Structured Data Storage: A data warehouse processes only structured data. This data is usually stored in relational databases, organized into tables that are standardized and structured. This allows data to be stored within clearly defined data structures such as columns and rows.
- Reporting and Analysis Focused: Data warehouses are optimized to allow data analysts and business intelligence teams to easily generate reports. This makes it easier to run fast queries and perform extensive data analyses.
- Data Cleaning and Integration: In data warehouses, ETL (Extract, Transform, Load) processes are applied to clean and merge data from different sources. This process ensures that data is loaded into the warehouse system in a consistent format and is clean and usable.
Advantages:
- High-Performance Reporting: Data warehouses are optimized for reporting and can handle complex queries efficiently, providing quick insights and facilitating high-performance reporting.
- Data Security and Consistency: Data warehouses maintain a high level of data security and ensure data consistency, providing a reliable environment for decision-making and analysis.
- Easy Querying and Accessibility: Data is well-organized, making it easy to query and access. This allows data analysts to quickly retrieve and work with the data.
Challenges:
- Works Only with Structured Data: A data warehouse is suitable only for structured data, meaning it is not appropriate for semi-structured or unstructured data such as text files, images, or videos.
- High Cost: The processing and storage of large data sets can be expensive, especially when working with massive amounts of data. Maintaining such a system might require significant investment in both infrastructure and operational costs.
Platforms Used to Build Data Warehouse (DW) Architecture:
- Google BigQuery: Offers serverless and highly scalable architecture, making it suitable for quick deployment and handling large volumes of data without the need for managing infrastructure.
- Amazon Redshift: A fast and scalable data warehouse solution on AWS, ideal for projects that are already integrated with the AWS ecosystem.
- Snowflake: A cloud-based platform with a shared architecture that supports multi-cloud deployments, offering great flexibility and scalability for data warehousing.
- Microsoft Azure Synapse Analytics (formerly SQL DW): Combines data warehousing with big data integration, making it a versatile option for organizations leveraging the Microsoft Azure ecosystem.
- Teradata: One of the traditional big data solutions, often used in large-scale, on-premises environments requiring complex data warehousing solutions.
- IBM Db2 Warehouse: IBM’s enterprise data warehouse solution, suitable for organizations needing a robust, on-premises solution with high security and reliability.
The platform chosen for the data architecture depends on the project’s scale, budget, technical requirements, and the team’s expertise. For small and medium-sized projects, serverless solutions like Google BigQuery or Snowflake, which offer quick setup and low maintenance, may be preferred. Amazon Redshift is advantageous for projects integrated within the AWS ecosystem. Microsoft Azure Synapse is recommended for enterprises using Azure, as it combines both Data Lake and Data Warehouse features. For real-time data needs or large datasets that require frequent updates, Snowflake’s time travel and performance capabilities stand out. On-premises solutions like Teradata or IBM Db2 Warehouse are ideal for cases that require high security and data sovereignty.
2. Data Lake (Data Lake)
A data lake is a flexible structure where structured, semi-structured, and unstructured data are stored together. This type of architecture is typically used to store raw data and is employed for advanced analytics. Data lakes are commonly used in large-scale data projects, particularly in fields such as data science and machine learning, where the ability to process diverse types of data is crucial.
Data lakes allow organizations to store vast amounts of data in its native format, making it easier to integrate and analyze data from various sources without the need for upfront structuring. This flexibility makes data lakes particularly suitable for big data projects that require extensive storage and processing capabilities.
Storage of Different Data Types:
A data lake can store various types of data (from databases to text files, images, and more) in its raw form. The data is typically stored in a file-based format (such as CSV, JSON, Parquet). This allows for the inclusion of both structured and unstructured data, providing great flexibility in data storage.
Data Processing Flexibility:
This architecture offers extensive flexibility for data engineers and data scientists to process data in any way they choose. It is well-suited for advanced analytics and machine learning tasks, as the raw data can be processed and transformed as needed.
Data Updates:
A data lake is ideal for handling continuously changing and growing datasets, supporting real-time or near-real-time data processing. This makes it especially useful for projects that require up-to-date information for analysis.
Advantages:
- Storage of Both Structured and Unstructured Data: Data lakes allow for the storage of a wide variety of data types, providing flexibility in data storage.
- Data Flexibility: Data lakes offer significant flexibility, allowing for easy addition of different types of data without requiring a rigid structure.
- Suitable for Machine Learning and Advanced Analytics: Due to their ability to store raw data, data lakes are ideal for complex machine learning tasks and advanced analytical processes.
Challenges:
- Complex Data Management: Managing data in a lake can be challenging. Without proper organization, data can become difficult to process, leading to the “Data Swamp” problem, where data becomes disorganized and hard to handle.
- Data Security and Access Control: Compared to data warehouses, managing data security and access control in data lakes can be more complex due to the variety of data types and formats stored within the system.
Platforms Used to Build Data Lake Architecture:
- Amazon S3: The most commonly used infrastructure for building data lakes, providing scalable and cost-effective storage for vast amounts of data.
- Azure Data Lake Storage (ADLS Gen2): A high-performance data lake solution built on Microsoft Azure, designed for large-scale analytics.
- Google Cloud Storage (GCS): Google Cloud’s solution for data lakes, offering scalable storage and integration with other Google Cloud services.
- Apache Hadoop HDFS: Preferred for on-premises systems, providing a distributed file system to store and process large data sets.
- MinIO: An open-source, S3-compatible platform for building data lakes, providing object storage that is scalable and flexible.
3. Data Lakehouse (Data Lake + Data Warehouse Mix)
A data lakehouse acts as a bridge between a data lake and a data warehouse, combining the flexibility of data lake processing with the structured data management capabilities of a data warehouse. This approach integrates both structured and unstructured data, offering flexibility between the two worlds. Essentially, a data lakehouse combines the raw data storage capability of a data lake with the optimized query performance for structured data found in a data warehouse. This makes it an ideal solution for organizations needing the best of both worlds, enabling easy integration of varied data types while providing efficient query performance for analytical purposes.
Flexibility and Structure:
A data lakehouse combines the flexibility of a data lake with the structure and performance of a data warehouse. While data can be stored in a structured format, semi-structured and unstructured data can also be processed and integrated. This hybrid approach allows organizations to handle various types of data while maintaining a consistent structure for analytical purposes.
Advanced Analytics and Reporting:
In a data lakehouse, both SQL-based queries and machine learning operations can be performed to meet analytical and reporting needs. This enables organizations to leverage both traditional business intelligence and advanced data science techniques on the same platform.
Advantages:
- Combines the performance of a data warehouse with the flexibility of a data lake.
- Merges the benefits of both approaches and provides a conducive environment for working with diverse data types.
Challenges:
- Complex setup and management: These types of architectures can be difficult to set up and manage due to the integration of both structured and unstructured data.
- Advanced data management and performance optimization: Requires careful management and optimization of both data processing and performance, making it more resource-intensive to maintain.
Platforms Used to Build Data Lakehouse Architecture:
- Databricks + Delta Lake: Often associated with the Lakehouse architecture, providing a unified approach to batch and streaming data processing with a strong focus on reliability and performance.
- Apache Iceberg: An open-source Lakehouse solution developed by Netflix, offering capabilities such as support for large-scale data lakes and ACID transactions.
- Apache Hudi: An open-source solution that supports data versioning and stream processing, often used for handling large volumes of incoming data with the ability to track changes over time.
- Azure Synapse Analytics: A platform that combines data warehousing and data lake capabilities, ideal for organizations using Microsoft Azure, offering seamless integration between both architectures.
- Snowflake (with recent updates): Has started to provide Lakehouse functionality, blending the performance and features of both data lakes and data warehouses.
- Google BigLake: Google Cloud’s Lakehouse solution, which integrates storage and analytics across multiple clouds, providing flexible and scalable data processing.
The platform selection for a data lakehouse architecture should be based on both the flexibility of big data and the performance expectations for analytics. If a combination of streaming and batch processing is needed, and open-source flexible solutions are desired, Databricks + Delta Lake is a strong choice. For enterprise Azure environments, Azure Synapse Analytics, which combines both lake and warehouse capabilities, is recommended. BigQuery + BigLake integration is advantageous for Google Cloud users, offering the ability to combine lake data with analytical queries. If data versioning, ACID compliance, and cost optimization are important, solutions like Apache Hudi or Apache Iceberg should be considered. Additionally, if there is a need to centrally manage data from different domains, Unity Catalog integrated with Databricks can be a good option for governance.
Here is a table comparing the three architectures described above:
4. Data Mesh
Data Mesh proposes a distributed architecture instead of a centralized data structure. In this approach, each department creates its own data product and shares it with other departments. Data Mesh makes the data architecture modular and decentralized, which can be particularly suitable for large and complex organizations.
- Distributed Data Management: Each department creates and takes responsibility for its own data product. This avoids centralizing data in one location, providing greater flexibility.
- Prevention of Bottlenecks: By avoiding the creation of a centralized data management structure, Data Mesh prevents the bottlenecks that typically arise in traditional centralized systems.
Advantages:
- The distributed structure allows for more flexible data management and access.
- Data ownership is shared among departments, with each department being responsible for its own data product.
Challenges:
- The absence of centralized data management can create challenges in maintaining data consistency and integrity.
- Data integration and processing workflows can become more complex.
Platforms Used to Build Data Mesh Architecture:
- AWS Lake Formation + Glue + S3: Provides domain-based data access and governance.
- Databricks Unity Catalog: Supports the data governance aspect of Data Mesh.
- Starburst / Trino: Enables cross-domain data querying and federation.
- Snowflake: Facilitates data sharing between domains with Secure Data Sharing.
- Kafka / Event Streaming (Confluent, Redpanda): Used for inter-domain data streaming.
- DataHub / Amundsen / OpenMetadata: Focuses on metadata management and cataloging.
The choice of platform for Data Mesh architecture depends on how ready the organization is for a domain-driven data ownership model, which moves away from centralization. If teams are structured to independently develop data products, solutions supporting centralized governance, such as Databricks Unity Catalog or Snowflake Secure Data Sharing, are recommended. If there is a need for data federation and unified querying across different sources, distributed query engines like Starburst or Trino are suitable. For metadata management and transparent data discovery, tools like DataHub, Amundsen, or OpenMetadata are ideal. In scenarios requiring event-driven data sharing, Kafka or Confluent infrastructure can enable real-time data flow between domains. These tools, when used together in organizations with clear internal data ownership and access policies, can establish a successful Data Mesh infrastructure.
In data architecture selection, each approach has its own advantages and challenges. A data warehouse provides a more structured and reporting-focused framework, while a data lake offers flexibility and capabilities for big data analytics. A data lakehouse bridges the gap between both worlds, and data mesh provides a more flexible and decentralized data management model. The right approach should be determined based on the project’s needs and long-term goals.
In this project, the data warehouse approach has been chosen because it will focus on working with structured data for fast reporting and business intelligence. Every project has different requirements, so selecting the appropriate approach is critical for ensuring a successful data management process.
Selection of the Right Approach
In this project, the data warehouse approach has been selected as the most appropriate choice. A data warehouse is the ideal approach for projects that require working with structured data and are focused on reporting and business intelligence. However, other approaches such as data lakes, data lakehouses, and data meshes may also be advantageous for specific projects.
Each approach has its unique advantages and challenges. For example:
- Data Warehouse provides strong reporting and analysis capabilities but can only work with structured data.
- Data Lake offers flexibility and a wide variety of data types but can result in complex data management.
- Data Lakehouse combines both worlds, offering flexibility and performance.
- Data Mesh presents a distributed architecture, but attention must be given to integration and consistency challenges.
The data warehouse approach has been adopted as an example in alignment with the project’s requirements. However, when determining the most suitable approach for any data project, factors such as data type, analytical needs, and use cases must be considered.
Detailed Explanation of Medallion Architecture: Modern Data Warehouse Design
Different approaches used in data warehouse design have a significant impact on the system’s flexibility, performance, and efficiency. In this article, we will examine popular data warehouse design approaches such as Inmon, Kimball, Data Vault, and Medallion, and then provide a more in-depth explanation of the Medallion Architecture.
Inmon Approach: Centralized Data Warehouse Design
Inmon is one of the first and oldest approaches to data warehouse design. According to Inmon, the data warehouse is designed as an enterprise data warehouse (EDW), where all data is stored in a single central location. In this approach, all data is normalized and loaded into the warehouse system using an advanced data model.
Features:
- Data is typically stored using the third normal form (3NF).
- Provides an enterprise-level approach, meaning a central data warehouse for the entire organization.
- The data integration process is complex, but it ensures a high level of data accuracy.
Advantages:
- Data is consistent and well-organized.
- Effective in large-scale projects and enterprise-level data integration.
Challenges:
- Slow development process, as everything needs to be restructured from scratch.
- Requires complex and time-consuming data modeling.
Kimball Approach: User-Friendly Data Warehouse
Kimball offers a more user-friendly and flexible approach compared to Inmon. In Kimball’s methodology, the data warehouse is designed in smaller, more specific sections called data marts. Data is organized using simple models such as the star schema and snowflake schema.
Features:
- Data is typically denormalized and optimized for easy querying.
- Each data mart serves specific reporting and analytical purposes for a particular business area.
Advantages:
- Provides easy access and fast querying.
- Ideal for small projects or specific analytical requirements.
Challenges:
- Denormalized data may lead to data redundancy in large datasets.
- Managing data consistency becomes more challenging.
Data Vault: Flexible and Modular Data Model
The Data Vault approach has brought a new perspective to data warehouse design by offering flexibility and modularity. In this approach, data is stored in its raw form and then processed by adding business rules. Data Vault is typically preferred in large and complex data projects.
Features:
- Provides quick adaptation and flexibility.
- Data accuracy and business rules are handled separately at each layer.
- Data is divided into three core components: Hub, Link, and Satellite.
Advantages:
- Allows rapid integration with various data sources.
- Easily adapts to changing business requirements.
Challenges:
- The complex data model may make management difficult.
- May require higher processing costs.
Medallion Architecture: Modern and Simplified Data Warehouse Design
Medallion Architecture is one of the most recent approaches to modern data warehouse design. This structure organizes the data processing process into three layers: Bronze (Raw Data), Silver (Cleansed Data), and Gold (Business Rule-Compliant Data).
Key Layers of the Medallion Architecture
- Bronze Layer (Raw Data):
The Bronze layer is where the data is initially received and stored in its unprocessed form. The data is kept in its raw and natural state, and no transformations are applied at this stage. The goal is to preserve the data exactly as it was originally received.
Data engineers store raw data in this layer for error debugging and traceability (tracking data sources and changes). - Silver Layer (Cleansed Data):
In the Silver layer, raw data is cleaned, normalized, and organized. This layer is dedicated to the data transformations and cleaning processes that prepare the data for analysis.
Any missing or erroneous parts of the data are corrected, and the data is made more consistent through improvements. - Gold Layer (Business Rule-Compliant Data):
The Gold layer is where the data is prepared for business intelligence, reporting, and analysis, with business rules applied. In this layer, data modeling and analysis are performed based on the needs of business users.
Data is aligned with business intelligence tools (such as Power BI, Tableau, etc.) and optimized for reporting processes.
Requirements of Each Layer
The requirements for each layer are illustrated in the image below. For example, the requirements for the Bronze layer are fulfilled and saved as a file, and no further processing is done. Then, the transformations in the Silver layer are applied in a separate file, and the data is further refined. Each layer is responsible for completing tasks specific to its scope. Finally, the Gold layer represents the stage where the data is ready for modeling and business intelligence tasks. These stages are clearly shown in the images below.
Advantages of Medallion Architecture
- Simple and Understandable: The Medallion structure is easily understandable without complex data models. Each layer represents a distinct purpose and function.
- Traceability: Since every stage of the data is traceable, any data issue can be quickly identified and resolved.
- Flexibility and Performance: It provides both flexibility and enables fast data processing and querying. Moreover, since each stage can be processed separately, data management is highly flexible.
Applications of Medallion Architecture
- Big Data Projects: The Medallion architecture is an ideal solution for projects where large amounts of data are collected and processed.
- Advanced Analytics and Machine Learning: Advanced analyses on raw data in the Bronze and Silver layers are supported by data that is prepared for reporting and analysis in the Gold layer.
- Data Warehousing and Business Intelligence Needs: The Medallion architecture is well-suited for both data warehousing and business intelligence projects.
The Medallion architecture is a highly flexible and efficient approach for modern data warehouse design. Offering significant advantages for both data engineers and business analysts, this model ensures clarity and traceability at every stage of the data processing. Medallion provides a crucial advantage, especially for projects that require advanced analytics and reporting.
Visualizing Data Warehouse Architecture
Data warehouse design and architecture often involve complex structures that are difficult to explain with text alone. Therefore, creating a visual representation is crucial for making data warehouse projects easier to understand and implement. Visual diagrams help in illustrating complex data flows and system structures, ensuring that all stakeholders grasp the design.
Key Elements in a Data Warehouse Architecture Diagram
When drawing a data warehouse architecture, the following key elements should be considered:
- Data Sources
In diagrams, data sources are typically represented within a box, connected by arrows pointing towards the data warehouse.
Sources can come in various formats, such as:
- Databases
- CSV Files
- APIs
- Web Services
Visualizing these elements marks the first step in the project’s data flow.
2. ETL Processes (Extract, Transform, Load)
The process of extracting data from the source, transforming it, and loading it into the data warehouse is known as ETL.
The steps — Extract (Data Collection), Transform (Data Conversion), and Load (Data Loading) — are often depicted with sequential arrows in the diagram.
Each of these steps represents a different phase in the data flow and should be clearly visible in the diagram.
3. Data Warehouse
The data warehouse is usually represented as a centralized structure where all data is collected and made ready for analysis.
The processed data is stored here and sent to reporting processes.
4. Layers
If an approach like Medallion Architecture is adopted, different layers (Bronze, Silver, Gold) should be clearly indicated in the diagram.
Each layer is visually represented with a label that describes the extent to which the data has been processed and its intended purpose.
5. Business Intelligence and Reporting (BI & Reporting)
Business Intelligence (BI) tools and reporting platforms, where data is presented to end-users, should also be included in the diagram.
Reporting tools form the final step for analyzing and interpreting the data.
By organizing the data warehouse components into these visual elements, it becomes easier for all project stakeholders to understand the structure, flow, and processes involved in the data warehouse architecture.
Visualized Data Warehouse Schema
Below is an example diagram illustrating how data warehouse architecture can be visualized:
- Data Sources (such as ERP, CRM systems) provide the data flow towards the data warehouse.
- ETL Processes show how data moves from the source to the warehouse.
- Layers and Business Intelligence Tools indicate how data is processed and presented to the users.
Using visual diagrams in data warehouse design is an effective way to make complex data flows and system structures more understandable. Clearly visualizing data flows, layers, and business intelligence tools helps all project stakeholders understand the process more easily. Having these diagrams serve as a reference throughout the project, guiding each stage, ensures the successful execution of data warehouse implementations.
Conclusion
The choice of data architecture is not just a technical decision; it is also a strategic and organizational one. In this blog post, different architectural approaches such as Data Warehouse, Data Lake, Data Lakehouse, and Data Mesh have been discussed in detail, explaining the scenarios in which each stands out, their advantages, challenges, and the platforms that can be used to implement them, with examples. Additionally, steps such as requirement analysis, identifying data sources, ETL processes, modeling, and documentation have been elaborated based on a real-world project, demonstrating how modern architectures like Medallion are applied in data engineering. Ultimately, the selection of the right data architecture should be made in alignment with the type of data, analytical needs, organizational structure, and long-term goals. This way, businesses not only work with data but also build agile, sustainable, and robust systems that generate value from data.