An Overview of Databases — Part 1: DBMS Flow

6 min readJul 19, 2024

Databases are a fundamental component of modern applications, enabling efficient data storage, retrieval, and manipulation. A Database Management System (DBMS) is software that interacts with users, applications, and the database to capture and analyze data. Here, we delve into the core flows and components of a DBMS, highlighting their importance and functionality.

There are 8 different parts, each part focuses on one specific area of the Database to help you have a better understanding of the Database and its ecosystem. We will learn how to choose the right DB for our use case.

We are not going to cover a lot of deep details about each part, these blog posts aim to help you get the gist and some important parts of designing a DB-heavy application.

Part 1: DBMS Flow
Part 2: Non-Relational DB vs Relational
Part 3: CAP and BASE Theorem
Part 4: How to choose a Database?
Part 5: Different Solutions for Different Problems
Part 6: Concurrency Control
Part 7: Distributed DBMS
>> Part 7.1: Distributed DBMS (Apache Spark, Parquet + Pyspark + Node.js)
Part 8: Clocks
>> Part 8.1: Clocks (Causal Consistency With MongoDB)
>> Part 8.2: Clocks (MongoDB Replica and Causal Consistency)
Part 9: DB Design Mastery
Part 10: Vector DB
Part 11: An interesting case, coming soon!

DBMS Flow:

The DBMS can be understood through four key flows:

Query Planning
A query plan (or query execution plan) is a sequence of steps used to access data in a SQL relational database management system.
Operator Execution
An operator is a keyword in SQL that helps us access the data and returns the result based on the operator’s functionality.
Access Methods
Access methods define how data is stored and retrieved in a DBMS.
Disk Manager
The disk manager is responsible for managing the physical storage of data on disk.

Flow #4: Disk Manager

The disk manager is a critical component of a Database Management System (DBMS) responsible for managing the physical storage of data on disk. Its primary functions include efficiently storing, retrieving, and managing data, ensuring optimal performance, and maintaining data integrity and durability.

It must efficiently handle different types of workloads and storage models to ensure optimal performance and reliability.

DB Workloads:

The Database Disk Manager plays a crucial role in optimizing the performance of different types of database workloads, specifically OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing). Understanding how these workloads interact with the disk manager helps design efficient database systems.

OLTP (Online Transaction Processing)

Designed for transaction-oriented tasks, such as inserting, updating, and deleting small amounts of data frequently.
Databases: RDS, PostgreSQL, MongoDB

2. OLAP (Online Analytical Processing)

Designed for analytical queries that involve reading large amounts of data, often for reporting and data analysis purposes.
Databases: Redshift, ClickHouse

3. Hybrid (Combines OLTP and OLAP)

Supports both transaction and analytical processing on the same database instance.
Databases: AWS RDS, Aurora

Storage Models:

The database storage mode whether row-oriented or column-oriented significantly impacts the role and operations of the Disk Manager in a DBMS. Understanding this connection helps in optimizing storage, retrieval, and overall database performance for various types of workloads.

1. N-ary Storage Model (Row Store)

Ideal for OLTP workloads where queries are likely to access individual entities and perform write-heavy operations.
Advantages: Fast inserts, updates, and deletes.
Disadvantages: Poor performance for scanning large portions of the table and not ideal for compression.

2. Decomposition Storage Model (Column Store)

Stores a single attribute for all tuples contiguously, which is beneficial for analytical queries.
Advantages: Reduced I/O, faster query processing, better data compression.
Disadvantages: Slow for point queries, inserts, updates, and deletes.

3. Hybrid Storage Model

Combines the advantages of both row and column storage to achieve faster processing and spatial locality.

Flow #3: Access Methods

Access methods define how data is stored and retrieved in a DBMS. These methods use specific data structures to organize and access data efficiently. The choice of access method can greatly affect the performance of database operations. Access methods have their own data set structures to organize data.

Two common data structures used in access methods are:

Hash Tables (Unordered)

Used for quick lookups, insertions, and deletions.
Example: DynamoDB uses hash tables and B-trees for fast data retrieval.

Trees (Ordered)

Used for range queries and ordered data retrieval.
A B+Tree is a self-balancing, ordered tree data structure that allows searches, sequential access, insertions, and deletions in O(log n)
Example: PostgreSQL provides B-tree, R-tree, and hash table access methods for indices.

Note: A DBMS uses various data structures for many different parts of the system internals:

Internal Meta-Data: This is data that keeps track of information about the database and the system state.
Ex: Page tables, page directories
Core Data Storage: Data structures are used as the base storage for tuples in the database.
Temporary Data Structures: DBMS can build DSs on the fly while processing a query to speed up execution.
Ex: hash tables for joins
Table Indices: Auxiliary DSs can be used to make it easier to find specific tuples.

Flow #2: Operator Execution

Operators are fundamental components in SQL that facilitate data access and manipulation. Each operator carries out a specific task, such as selection, projection, or join, and the result of an operator's execution depends on its functionality. The execution of these operators follows a specific precedence, which can significantly impact the performance and result of the query.

Selection: Retrieves rows that satisfy a given condition.
Projection: Select certain columns from the table.
Join: Combines rows from two or more tables based on a related column.

You can find more information about Operators here, and about Execution Precedence in SQL here.

Flow #1: Query Planning

Query planning is the process by which the DBMS translates a high-level SQL query into an executable plan. This involves parsing the query, validating it, and determining the most efficient way to execute it.

Parsing: The DBMS parses the SQL query to check for syntax errors.
Validation: Ensures that the query refers to valid database objects and permissions.
Optimization: The optimizer evaluates various possible plans and selects the most efficient one based on cost estimation, which includes factors like I/O operations, CPU usage, and network latency.

The DBMS converts a SQL statement into a query plan, operators in the query plan are arranged in a tree, and data flows from the leaves of this tree towards the root. The output of the root node in the tree is the result of the query, the same query plan can be executed in multiple ways.

We can use the `EXPLAIN` command to see what query plan the planner creates for any query. Plan reading is an art that requires some experience to master.

Sample on Query planning: Read more here

db=> EXPLAIN SELECT * FROM my_partitioned_table WHERE ordered_at > '2024-01-01' AND ordered_at < '2024-12-01' limit 1;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.19 rows=1 width=1231)
   ->  Seq Scan on my_partitioned_table_partitioned_2024 my_partitioned_table  (cost=0.00..1553.04 rows=8336 width=1231)
         Filter: ((ordered_at > '2024-01-01'::timestamp) AND (ordered_at < '2024-12-01'::timestamp))
(3 rows)