Data Management Basics: From Structured to Data Lakes

Jose Cerezo Egea
SDG Group
Published in
10 min readMay 21, 2024

Nowadays, it is vitally important to have correct and effective data management, as it can contain valuable information. In this article we will discuss some of the fundamentals of data management, from the difference between structured and unstructured data to the differences between OLAP and OLTP. In addition, we will discuss key concepts such as ETL/ELT, data lakes and data warehouses.

Structured VS Unstructured Data

In the context of data, it is essential to understand the difference between structured and unstructured data. The former refers to a type of data that has a specific and defined format, such as a database table, where each field in the table contains a specific type of data (string, integer, etc.).

On the other hand, unstructured data is a type of data that does not follow a previously defined schema, but is flexible and, therefore, difficult to organize within a table or database, such as emails, audio or video files, etc.

Image 1. Example of structured data vs. unstructured data.

Both types of data play a very relevant role nowadays and each has its advantages and disadvantages. Some of them are:

Table 1. Comparison of structured and unstructured data.

This example illustrates the difference between structured and unstructured data. While structured data is organized in defined fields, unstructured data is plain text that may require additional analysis to extract relevant information.

import csv

# Loading structured data from a CSV file
with open('clients.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
first_name = row['first_name']
last_name = row['last_name']
age = row['age']
email = row['email']
print("First name:", first_name)
print("Last name:", last_name)
print("Age:", age)
print("Email:", email)
# Output of structured data:
# First name: John
# Last name: Doe
# Age: 35
# Email: john.doe@example.com
# …

# Load unstructured data (Customer reviews) from txt file
with open('customer_views.txt', mode='r') as txt_file:
opinions = txt_file.read().split('\n\n')
for i, opinion in enumerate(opinions, 1):
print(f"Opinion {i}:")
# Output of unstructured data:
# Opinion 1: Today I had an amazing experience at your store. The staff was very friendly and helpful.
# …

Data Warehouses

Data warehouses are systems that allow storing large volumes of data in a structured and centralized way. They allow data from various sources to be transformed and integrated in a single, coherent place for subsequent analysis to obtain insights and support decision making.

It is important that data is always up to date and accessible, and to improve data quality, data warehouses usually require cleansing and validation processes to improve quality and integrity. On the negative side, data warehouses tend to be somewhat costly to maintain, as they must comply with the above points.

There are other important components in the data architecture besides data warehouses:

Data Marts: These are subsets of a data warehouse that are designed to meet the analysis needs of a specific functional area of an organization such as sales or finance (departmental) or can be created based on geographic region or product line (distributed).

Data Vault: It is a data warehouse design model consisting of three components:

  • Hubs: represent key concepts, e.g., customers, and are typically modeled in a single table with unique identifiers (primary key).
  • Links: represent the links or relationships between Hubs, typically as a separate table and foreign keys.
  • Satellites: represent the descriptive attributes of a Hub or Link, typically as a separate table and foreign keys.

This design facilitates data auditing, data lineage and new data integration, providing greater flexibility and scalability.

Data Lakes

A Data Lake is a centralized data storage repository that allows storing, managing and sharing large volumes of data in its original format, i.e. data without prior processing. That is why Data Lakes do not require a predefined schema, but can store data in different formats (structured, semi-structured and unstructured data) even without knowing how they will be used in the future, in contrast to Data Warehouses, which do require prior transformation of the data so that they conform to a defined schema.

The flexibility of Data Lakes make them the ideal systems for organizations that want to perform advanced analysis and discover valuable information that could have been eliminated in a traditional system preprocessing. In addition, by not imposing rigid structures, data exploration and analysis is agile and without limitations.

Finally, the option of integrating data from different sources and formats makes Data Lakes a centralized and highly scalable place to store raw data from the entire organization, i.e., they act as a meeting point for collaboration between different teams and areas within an organization, improving synergy between departments and decision making based on solid and comprehensive data.

Best practices for the implementation of a Data Lake

The implementation of a Data Lake is critical to the success of data management and analysis in the enterprise environment. Here are some of the best practices for its effective and efficient implementation.

Division of the Data Lake into multiple layers (or zones)

In order to improve the organization and accessibility of the data in the Data Lake, it is recommended that it be divided into multiple layers, where each layer will contain a specific type of data and will be used for different purposes. The most common layers are:

  • Raw / bronze / landing layer: As its name indicates, this layer contains the data in its original form, i.e. as received from the different data sources. This layer contains the data to which no transformation has been performed, so it acts as an entry point for data ingestion. In addition, the data in this layer are stored forever (immutable) and serve as history.
  • Conformed / standardized layer: In general, raw data is ingested in different formats such as CSV, JSON, etc. This layer is the one that transforms all these formats to a standard format, usually Parquet, which is a type of storage optimized to deal with a lot of data, so that we have consistent and optimized data for further analysis.
  • Cleansed / silver / processed layer: This layer is the one that acts as a filter, because here the data is transformed (cleaned, integrated and consolidated) in order to obtain a uniform layer in terms of schema and format. In addition, data can be normalized and metadata can be added to facilitate traceability and efficient search.
  • Presentation / gold / application layer: Using the cleansed data from the previous layer, business rules are applied to the data so that it is ready to be consumed by end users, such as data scientists or analytics applications. Transformations can also be applied with the goal of enhancing the data and enriching it, for example, by applying aggregations or summaries.
  • Sandbox / exploration layer: This layer can be developed optionally, as it is usually a copy of the raw layer where the data can be read and modified. It is a useful layer from the point of view of data scientists, since they can explore the data without having lost any information (no transformations) and with a lot of flexibility. Besides being an optional layer, multiple layers of this type can be built if necessary.
Image 2. Data Lake layers.

Creation of a folder structure

Another good practice is to create a folder structure within the Data Lake, so that data can be accessed efficiently. Depending on the interests of the organization, this folder structure can be done in several ways:

  • Data segregation: Consists of separating data depending on its origin, data type or business unit. It facilitates data access for data scientists so that they are able to retrieve data relevant to their use case.
  • Access control: Used if data must have different levels of permissions, so that not all users in the organization can access all data. In this case access policies must be defined.
  • Compliance requirements: It may be interesting to store sensitive data in different folders, so that they have stricter privacy and security policies.
  • Backup and disaster recovery: It is possible to have different types of folders depending on the frequency with which backups are made. For example, having a folder with frequent backups that contains data that should never be lost or, on the contrary, having a folder that is never backed up, because it contains data that is not necessary to store because it can be regenerated.

Implementation of audit and integrity controls

For any Data Lake it is essential to design a series of mechanisms that guarantee that the data does not suffer alterations when passing from one layer to another. These mechanisms are:

  • Data audit: to store all the modifications that have been made to the data in order to guarantee traceability.
  • Integrity checks: having a series of automated checks that verify the consistency of the data throughout the transformation process. For example, if a format transformation is performed, it must be verified that the same number of files is present in the source as in the destination and if this is not the case, the transformation process must be restarted.

OLAP and OLTP

Both OLAP (Online analytical processing) and OLTP (Online transaction processing) are two data processing systems that allow efficient interaction with data, although each was designed to meet specific and complementary needs, which we will see below.

On the one hand, OLAP is a data processing system focused on the fast and efficient analysis of large volumes of data. It is a technology designed to perform complex queries on multidimensional data sets, which may have been consolidated and/or aggregated from different sources, using optimized storage and processing techniques to ensure fast response times.

It also offers the ability to perform comparative analysis, forecasting, trend and segmentation analysis, as well as providing interactive tools to explore the data quickly and at different levels of granularity, thus being able to perform very deep analysis (data drilling) or more in the form of a summary (roll-up), depending on the needs.

OLAP is therefore useful for extracting valuable information about data stored in data warehouses, which can be used to support strategic decision making.

On the other hand, OLTP is the data processing system used for real-time transaction processing. It is a technology mainly used in operational database systems in which all kinds of daily transactions are stored, such as sales, orders, payments, etc.

These systems are optimized to support a high volume of transactions in real time with almost no delays, in addition to guaranteeing the integrity and concurrency of these, i.e.,they guarantee that all transactions will be executed, the database will always be consistent and multiple transactions can be executed in parallel.

OLTP is therefore useful in providing a reliable environment for real-time transactions, ensuring that data insertion, modification and deletion operations are performed quickly, securely and consistently.

Image 3. OLAP and OLTP summaries.

In the following example we can see the difference between an OLAP and OLTP query. The OLAP query is typically used for analytical purposes, such as obtaining the monthly sales total, while the OLTP query focuses on individual records and real-time updates, such as inventory shrinkage after a sale.

# OLAP Query: Analysis of sales by month
olap_query = """
SELECT month, SUM(total_sale)
FROM sales_table
GROUP BY month;
"""
# OLTP Query: Inventory update
oltp_query = """
UPDATE inventory_table
SET quantity = quantity - 1
WHERE product_id = 'product_to_sell';
"""

ETL/ELT

In order to derive value from data, it is imperative to manipulate and adapt it as required. This often involves implementing robust processes to integrate and process data so that organizations are able to convert raw data into actionable and valuable information that can be useful to support decision making.

Both ETL and ELT are methodologies that allow us to perform processing on data, however, the key difference between these two approaches is the order of execution of the operations that manipulate the data. The ETL methodology executes the following operations, in order, on the data:

1. Extract (E): the extraction phase consists of obtaining data from different sources, which can be databases, applications, file systems, etc. In this phase the data is raw, i.e., they have not undergone any kind of processing, they have only been stored.

2. Transform (T): the transformation phase is about performing all the necessary operations on the data that have been previously extracted, in order to obtain a quality and coherent data layer. Operations may include cleaning of erroneous or duplicated data, format conversion, aggregation, etc.

3. Load (L): the loading phase consists of loading all the data that have been transformed into database tables or other types of data storage, such as data warehouses. In this phase is where the data structuring takes place, optimized for analysis and subsequent query.

In the case of ELT, the order of operations is different:

1. Extract (E): data is extracted from various sources, as in ETL.

2. Load (L): the data is stored without being transformed.

3. Transform (T): transformations are performed on the stored data, which may involve the use of data analysis tools, complex SQL queries, or advanced analysis processes.

Image 4. ETL and ELT workflows.

As we have seen, ETL and ELT are similar methodologies, but they differ in the timing and manner in which transformations are performed. Let’s look at some of their advantages and disadvantages:

Table 2. ETL vs ELT.

By the nature of these approaches, we can say that ETL is used in Data Warehouses, as data is transformed before loading to maintain consistency and integrity, while ELT makes sense to use in a Data Lake, as data is stored raw and transformed as needed.

Now that we understand both methodologies, let’s look at a Python example that illustrates the fundamental differences.

import pandas as pd

## ETL
# Step 1: Extract data
raw_data = pd.read_csv('sales_data.csv')

# Step 2: Transform data
raw_data['Date'] = pd.to_datetime(raw_data['Date'])
transformed_data = raw_data.groupby(raw_data['Date'].dt.month).sum()

# Step 3: Load data
transformed_data.to_csv('sales_per_month.csv')
import pandas as pd

## ELT
# Step 1: Extract data
raw_data = pd.read_csv('sales_data.csv')

# Step 2: Load data
data_lake = raw_data.copy() # Simulated data lake, storing the raw data

# Step 3: Transform data (in the data warehouse)
data_lake['Date'] = pd.to_datetime(data_lake['Date'])
transformed_data = data_lake.groupby(data_lake['Date'].dt.month).sum()

Conclusions

In this article, we have reviewed the fundamentals of data management, from the distinction between structured and unstructured data to technologies such as data warehouses and data lakes and their best practices. In addition, we have highlighted the difference between OLAP and OLTP, as well as the relevance of ETL/ELT processes and code examples or use cases where it is interesting to apply this knowledge.

--

--