Understanding Database Partitioning in Distributed Systems : Secondary Index

Priya Patidar
6 min readJan 19, 2024

--

Introduction

In our previous discussions, we delved into the intricacies of partitioning key-value stores and the role of primary indexes in managing and accessing data efficiently. Building upon that foundation, we now turn our attention to an equally important aspect of database management: partitioning of secondary indexes.

To begin, let’s clarify what a secondary index is. In simple terms, a secondary index is an additional way to organize and access data in a database, separate from the primary key. While the primary key is unique for each record and is used for primary data retrieval, a secondary index is based on non-primary key attributes. It’s like having an extra set of labels on your files, not just the main label, which allows you to find data based on different criteria.

For example, consider a database of library books. The primary key might be a unique book ID, but you could have a secondary index based on the book’s genre or author. This way, if you want to find all books by a specific author, you don’t have to search through the entire database; you can quickly access them through the author-based secondary index.

Now, the challenge arises when we consider how to partition these secondary indexes. Unlike primary keys, which are often uniformly distributed, secondary indexes can be skewed due to the non-uniqueness and variability of the attributes they’re based on. Efficiently partitioning secondary indexes is crucial for maintaining balance and performance in a distributed database system.

In this article, we will discuss two main approaches to tackle this challenge: ‘Partitioning Secondary Index by Document’ and ‘Partitioning Secondary Index by Term’. Let’s delve into each method to understand their implications and applications

Partitioning Secondary Index by Document

Example Setup:

Our online bookstore database is partitioned by genre, and within each genre, we have a local secondary index based on author names.

Partitions by Genre:

  • Novel
  • Science Fiction
  • Biography

Local Secondary Index in Each Partition:

  • Novel: Index by Author
  • Science Fiction: Index by Author
  • Biography: Index by Author

Scenario: Searching by Author Name

Let’s say a user wants to find all books by “Harper Lee.” The query process involves:

  1. Search Each Partition: The system searches the local secondary index for “Harper Lee” in each genre partition.

2. Aggregate Results: If “Harper Lee” books are found in multiple partitions, the system aggregates these results.

Reading Data

Read Operation in Novel Partition:

  • The system checks the local secondary index for “Harper Lee” in the Novel partition and quickly retrieves the relevant book IDs.
  • The system then fetches the details of those books from the Novel partition.

Read Operation in Other Partitions:

  • The system repeats the above process for the Science Fiction and Biography partitions. This can be done parallel to all partitions.

Updating Data

Adding a New Book:

  • When a new book by “Harper Lee” is added to the Biography partition, the book record is inserted into the Biography partition.
  • The local secondary index for authors in the Biography partition is updated to include this new book under “Harper Lee.”

Modifying an Existing Book:

  • If a book’s details are modified, say a book by “Harper Lee” in the Novel partition, the record in the Novel partition is updated.
  • If the update affects the index (e.g., changing the author name), the local secondary index in the Novel partition is also updated accordingly.

The querying process across multiple partitions is known as “scatter and gather.” This involves parallel queries to each partition to collect the needed data. Although parallelization can improve speed, scatter and gather can be resource-intensive, especially in large databases. The main costs are associated with querying each partition separately and then consolidating the results, which often requires significant computational and network resources. Consequently, while local secondary indexes enhance partition-specific query efficiency, they also necessitate a careful consideration of the overhead involved in scatter and gather operations in distributed database environments.

Partitioning Secondary Index by Term

In the “Partitioning Secondary Index by Term” approach, rather than keeping a local secondary index within each partition, we create a global secondary index based on a particular term or attribute, such as the author’s name. This index spans across all partitions and allows for direct access to records based on that term, regardless of the partition they reside in.

Using the Bookstore Example

Partitions by Genre

  • Novel
  • Science Fiction
  • Biography

Global Secondary Index by Author:

  • Instead of having separate author indexes in each genre partition, we have one comprehensive index for authors that covers all genres.

Scenario: Searching by Author Name

When a user searches for books by “Harper Lee,” the process is more straightforward:

  1. Query the Global Index: The system checks the global secondary index for “Harper Lee.”
  2. Retrieve Results Directly: The index provides direct references to all “Harper Lee” books across all genre partitions.

Reading Data:

  • The query directly accesses the global secondary index, retrieves the book IDs associated with “Harper Lee,” and then fetches the book details from their respective partitions.

Updating Data

Adding a New Book:

  • When a new book by “Harper Lee” is added, regardless of its genre, the book record is inserted into the appropriate genre partition.
  • The global secondary index for authors is updated to include this new book under “Harper Lee.”

Modifying an Existing Book:

  • If details of a book by “Harper Lee” are modified, the record in its respective genre partition is updated.
  • The global secondary index is also updated if the modification affects the term used in the index (e.g., changing the author name).

The “Partitioning Secondary Index by Term” approach offers a more efficient solution for queries that span multiple partitions, like searching by author across all genres. It reduces the overhead of scatter and gather operations since the index is global and not confined to individual partitions. However, this method might introduce challenges in maintaining the global index, especially in a rapidly changing data environment. It’s a trade-off between the ease of cross-partition queries and the complexity of maintaining a global index.

Real-World Applications: Choosing Between Local and Global Secondary Indexing

When handling read queries, Partitioning by Document (Local Secondary Index) focuses on efficient data retrieval within individual partitions. This means when a read query is executed, it’s directed to a specific partition’s local index, leading to faster access but only within the confines of that partition. On the other hand, Partitioning by Term (Global Secondary Index) streamlines read queries across the entire database. This approach uses a single, comprehensive index that spans all partitions, allowing queries to retrieve data from multiple partitions simultaneously, offering a broader and more integrated view of the database.

Example of Local Secondary Index Use:

Imagine an e-commerce platform that has a database partitioned by product categories like Electronics, Clothing, and Home Goods. Each category partition has its own local secondary index based on product brand.

Use Case for Local Index:

  • A customer wants to view all Sony products in the Electronics category.
  • The query accesses the local secondary index in the Electronics partition, quickly retrieving all Sony products in this category.
  • This is efficient because the query is specific to the Electronics category and doesn’t require data from other partitions.

Example of Global Secondary Index Use:

Consider a multinational corporation’s employee database, partitioned by country — USA, UK, Japan, etc. There’s a global secondary index based on employee roles like Manager, Engineer, Salesperson.

Use Case for Global Index:

  • The HR department wants to identify all Managers across the corporation, regardless of the country.
  • The query utilizes the global secondary index for the ‘Manager’ role, swiftly gathering data across all country partitions.
  • This approach is effective as it needs to aggregate data from multiple partitions, which would be cumbersome with local indexes.

Conclusion

In this article, we explored two pivotal approaches to partitioning secondary indexes in database systems: Partitioning by Document (Local Secondary Index) and Partitioning by Term (Global Secondary Index). Each method has its strengths and trade-offs, with the former enhancing efficiency within individual partitions and the latter streamlining cross-partition queries.

Understanding these techniques is crucial for database architects and developers, especially when dealing with large and distributed databases. The choice between local and global secondary indexing should be informed by the specific needs of the application, considering factors like query patterns, data distribution, and system scalability.

As we continue to delve into the complexities of database management, our next discussion will focus on how to choose the right partitioning strategy for our keys and the intricacies of rebalancing partitions. This upcoming topic is vital for maintaining optimal performance and scalability in a constantly evolving data environment. Stay tuned for more insights into making informed decisions in the dynamic world of database architecture.

--

--