Database Engineering Part 4: Understanding Data Pages and Indexing

Augustine Umeagudosi
7 min readDec 9, 2023

--

Photo by Shubham Dhage on Unsplash

Welcome to Part 4 of our comprehensive database engineering series! In this instalment, we delve into the fundamental pillars of database architecture — data pages and indexing. Understanding these crucial elements is key to optimising performance, query execution, and overall efficiency within database systems.

Data Pages

In the intricate world of databases, data pages stand as the foundational units, akin to the pages of a book in a library. They serve as the bedrock upon which the entirety of database operations and information retrieval relies.

credit: SQL Server

Much like a typical book page has its structure (headers, paragraphs, page numbers, and footers), a data page in a database follows a specific structure. At its core, a data page contains a header that holds metadata about the page itself, including information about the page number, size, and possibly pointers to other pages. The body of the data page consists of rows or records, where actual data is stored. These rows can vary in size but are organized and structured according to the database schema. Pointers or links may also exist within the data page, establishing relationships between different pages or data entries.

Data pages are physically stored on disk or in memory, similar to how books are stored on a shelf or temporarily opened for reading. When data is written to a database, it is stored in data pages, which are then allocated space within storage mediums like hard drives or solid-state drives. In memory, these pages are cached for faster access, mimicking the process of opening a book to a certain page for quicker reference. The database management system handles the intricate tasks of managing the storage, ensuring data integrity, and optimizing retrieval using methods such as ACID as we discussed in part 2 of this series.

In some database management systems like Oracle or SQL Server, an Extent represents a contiguous set of data pages that store specific information types, serving as an allocation unit within the managed storage space. The size of an extent can vary, typically ranging from a few megabytes to several megabytes or more, depending on the database system and its configuration. When data needs storage, the database allocates extents accordingly, reducing overhead by using these contiguous blocks rather than allocating individual data blocks for each piece of information. As the database requires additional space for storing more data, it assigns new extents, ensuring efficient management of storage and accommodating the growing volume of data within the database.

Indexing

Indexing can be likened to the table of contents in a book. It complements data pages by providing a structured way to access information stored in the data pages on demand. Indexes serve as pointers or guides to efficiently locate specific data within data pages. They organize and sort data based on certain columns or fields, facilitating quicker search and retrieval operations. By using indexes, the database system can swiftly navigate through data pages, similar to quickly finding a chapter in a book by referring to its table of contents.

Types of Indexes

Database management systems employ various types of indexes tailored to different scenarios and data structures.

I. B-Tree indexes:

B-tree indexes operate by hierarchically organizing data to enable quick search operations with logarithmic time complexity. In databases, B-Trees efficiently sort and locate data through balanced tree structures, facilitating rapid access to information based on specific keys. For instance, in a contact list table, a B-Tree index on the “last name” field would swiftly retrieve contacts alphabetically, while optimizing search speed even in large datasets, thus, making them ideal for databases handling vast amounts of structured data with diverse access patterns.

II. Hash Indexes:

Hash indexes utilize hash functions to map keys directly to storage locations, enabling rapid exact-match lookups. In real-life applications, hash indexes excel in scenarios requiring quick access to specific values, such as in programming languages like Python, where dictionaries use hash indexes for fast retrieval of key-value pairs. However, due to their nature of unordered storage, they are less suitable for range queries or sorting operations, making them ideal for direct and precise data retrieval needs.

III. Bitmap Indexes

Bitmap indexes are a special way of organizing information in a database column. They use something like a big checklist for each different value in that column. Each item in the checklist represents a row in the table, showing if that value exists in that row. For example, in a survey database, a bitmap index on the “gender” column (with values like ‘Male’ and ‘Female’) would quickly show which people are male or female. This makes it easy to group and study responses based on gender, like in surveys about people’s preferences or market research on different genders’ shopping habits.

IV. Other Structures

R-trees are spatial index structures utilized in mapping applications or GPS systems, efficiently organizing spatial data for range and nearest neighbour queries. For instance, in navigation apps like Google Maps, R-trees enable swift location-based searches by organizing map data hierarchically. On the other hand, Full-Text Indexes are essential in search engines like Google, enabling rapid textual searches by indexing words or phrases in documents or web pages, and facilitating quick and accurate information retrieval based on textual content. These diverse index structures cater to specialised data types and retrieval needs, optimizing performance in their respective domains.

While indexing significantly enhances query performance and data retrieval speed in databases, it also comes with certain drawbacks or cons:

1. Overhead on Write Operations: Each index added to a database table incurs an overhead during write operations (inserts, updates, and deletes). When data changes, indexes must be updated accordingly, potentially impacting the speed of these operations. Maintaining multiple indexes can significantly increase write operation times, impacting overall system performance.

2. Increased Storage Requirements: Indexes consume additional storage space. For large databases or tables with numerous indexes, this extra storage can be substantial. Consequently, it might lead to increased storage costs and utilization.

3. Index Maintenance Overhead: Regular index maintenance is crucial to ensure optimal performance. This involves tasks like rebuilding or reorganizing indexes to prevent fragmentation, updating statistics, and reviewing indexing strategies. Mismanaged or neglected maintenance can result in degraded performance over time.

4. Risk of Over-Indexing: Over-indexing, or creating indexes on too many columns, can lead to diminishing returns. While indexes speed up read operations, they may slow down write operations and consume excessive storage. Moreover, queries might use unnecessary indexes, leading to increased memory usage and potentially impacting overall performance.

5. Query Plan Complexity: In some cases, a database query optimizer might struggle to choose the most efficient query plan when multiple indexes exist. This situation can result in suboptimal query performance or longer optimization times.

Understanding these drawbacks is crucial when designing a database schema and choosing which columns to index. Effective indexing strategies involve balancing the benefits of improved query performance against the potential downsides, aiming to strike an optimal trade-off for a specific database workload and usage patterns.

Indexing Best Practices

  1. Selective Indexing: This involves strategically indexing columns based on query patterns to enhance performance without overburdening the system. For instance, in an e-commerce platform, selectively indexing high-traffic search fields like “product name” or “category” can expedite search results while avoiding unnecessary indexing of less frequently queried fields (like product_image), thereby, ensuring optimal performance without excessive resource consumption.
  2. Query Optimization: Indexing significantly impacts query optimization by expediting data retrieval. For instance, in a customer database, an index on the “customer ID” column accelerates searches for specific customers, reducing query execution time. However, poor indexing choices might lead to suboptimal performance. Effective indexing aligns with query patterns, ensuring quicker access to data and enhancing overall database performance.
  3. Monitoring and Tuning: Regularly monitoring index usage and performance is vital. Tools like database performance monitoring software can track index utilization, identifying underused or overused indexes. Strategies for index tuning involve periodically reassessing and adjusting indexes based on changing query patterns or data volumes. For example, in a social media platform, tuning indexes for user profiles or trending topics based on usage trends optimizes query response times and overall system efficiency. Regular maintenance and adjustments ensure that indexes remain aligned with evolving database needs, thus, enhancing performance.

Index Fragmentation

This refers to the phenomenon where the logical order of data stored in a database index becomes scattered or disorganized physically over time. This degradation occurs due to insertions, updates, and deletions of data, leading to uneven distribution of data pages within the index structure. As a result, the data pages are not contiguous on the storage medium, impacting query performance.

Index fragmentation poses a challenge as indexes become scattered, impacting query performance; for instance, in a financial database, excessive indexing on transaction records leads to fragmentation, thereby slowing down data retrieval. Balancing read and write operations becomes critical, as witnessed in a customer relationship management (CRM) system where frequent updates are affected by write-heavy indexes, thereby impeding the overall system responsiveness.

Mitigation Strategies

  1. Regular Maintenance: Periodically reorganizing or rebuilding indexes to defragment the data and optimize storage layout.
  2. Monitoring Tools: Utilizing database monitoring tools to identify fragmentation levels and take corrective actions when necessary.
  3. Proper Indexing Practices: Implementing effective indexing strategies to minimize fragmentation by considering data insertion, update patterns, and workload characteristics.

Considerations for Choosing Indexes

On an e-commerce app, an understanding of how people look for products and using that to decide how to organize their lists necessitated the need to organize products in categories like clothes, electronics, etc. Similarly, in a map application, sorting places based on their locations, helps with showing maps and finding the best routes. These considerations illustrate the significance of aligning indexing strategies with workload patterns and data distribution for optimal database performance in real-world scenarios.

References

  1. https://learn.microsoft.com/en-us/sql/relational-databases/pages-and-extents-architecture-guide?view=sql-server-ver16
  2. https://www.geeksforgeeks.org/indexing-in-databases-set-1/
  3. https://www.pragimtech.com/blog/sql-optimization/how-do-sql-indexes-work/
  4. https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-index-design-guide?view=sql-server-ver16

What’s Next?

In the next part of this Database Engineering blog series, we will uncover the secrets behind accelerating data retrieval, fine-tuning performance, and maximizing efficiency within databases while exploring various techniques, from fundamental principles to advanced methodologies, guiding you through a journey of unlocking the true potential of organized data.

Click here to read the previous article in this blog series

Click here to read the next article in this blog series

--

--