Optimizing Backfill Scripts: How We Transformed Our Data Update Process

Published in

Picus Security Engineering

5 min readFeb 8, 2024

Introduction

In the world of data management, updating large datasets efficiently is a common challenge. This post recounts our journey of updating millions of rows in a vast database. Initially, we used standard UPDATE statements to modify new columns with values from other tables. However, as our dataset grew, this approach became inefficient. Through experimentation, we found innovative ways to improve our backfill scripts and pagination techniques, resulting in a significant performance boost. Join me as I explore our challenges and solutions.

Our table, with more than 10 million records, contains diverse data types. Slow read response times due to data volume and complex queries prompted us to pre-compute and store certain fields in new columns for faster retrieval.

The Challenge

As expected, the following script proved to be excessively time-consuming, often leading to SQL timeouts and script failures. Moreover, it significantly increased the load on our database

UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;

Journey to the Solution

In our pursuit of optimizing data updates, we implemented a paginated approach, handling the updates in manageable batches within a loop. To reduce the strain on our database, we also introduced a sleep interval after each update cycle. The revised SQL script was structured as follows:

UPDATE table_name
SET column1 = value1, column2 = value2
FROM (SELECT t.id
      FROM table_name t
      ORDER BY t.id ASC
      LIMIT ? OFFSET ?) AS subquery
WHERE subquery.id = t.id
  AND t.condition ....;

In go, we iterate through table using the script below.

 for ; i < countOfRows; i += pageSize {
  log.Info("page number: ", page)

  err := update(i, i+pageSize)
  if err != nil {
   log.WithError(err).Error("error while updating table, failed page: ", page)
   return
  }
  time.Sleep(time.Duration(sleepInMS) * time.Millisecond)

  page++
 }

Later on, we realized we need get smaller subset in subquery, so script will be faster to do this we need to move conditions that we used to the subquery.

UPDATE table_name
SET column1 = value1, column2 = value2
FROM (SELECT t.id
      FROM table_name t
      WHERE t.condition ....
      ORDER BY t.id ASC
      LIMIT ? OFFSET ?) AS subquery
WHERE subquery.id = t.id;

Initially, a batch-processing strategy that used an OFFSET clause for pagination showed promising results by keeping database load manageable and speeding up updates. However, as the dataset grew, the OFFSET’s increasing values led to slower execution times due to the need to traverse larger datasets for new batch starts, turning a 1.5-second process into over 15 seconds per iteration. Clearly, a new strategy was needed to maintain efficiency 🙂

To address this, a revised strategy was adopted that eliminated the OFFSET clause and instead utilized the ID of the last row updated in each batch. This approach leveraged the indexed primary key for more efficient querying, ensuring consistent execution times across batches by only considering rows with IDs greater than the last updated ID. This method not only improved the efficiency of the update process but also maintained stable performance throughout, effectively resolving the inefficiencies encountered with the original pagination strategy.

WITH updated AS (
    UPDATE table_name
        SET column1 = value1, column2 = value2
        FROM (SELECT t.id
              FROM table_name t
              WHERE t.id > 0
              ORDER BY t.id ASC
              LIMIT ?) AS subquery
        WHERE subquery.id = t.id
        RETURNING t.id
)
SELECT MAX(id) FROM updated;

In go, we’ve updated the script

for true {
  log.Info("page number: ", page)  
  err, lastId:= update(beginFrom, pageSize) // beginFrom not set if no entity remained

  if err != nil {
   log.WithError(err).Error("error while updating table, failed page: ", page)
   return
  }

  if beginFrom == lastId {
    // successfully completed
    return;
  }
  beginFrom = lastId
  time.Sleep(time.Duration(sleepInMS) * time.Millisecond)
   page++
  
 }

Compare Performances

Upon running the second query, the database load soared to 80%, making it clear that the process was too time-consuming and impractical — it would have taken over 8 hours to complete. You can see the time I’ve run the query, I had to stop every time I’ve started it because of the load.However, with the new approach, we dramatically reduced the query time to just about 40 minutes, significantly easing the database load.

New Hybrid Suggestion to Traditional Pagination

In traditional pagination methods, especially for database queries, the use of OFFSET can lead to performance issues, particularly as data sets grow larger. A more efficient approach combines UI-driven pagination with backend optimizations. The key here is to use the ‘last seen’ ID, known to the UI, to guide the backend query.

How It Works:

Standard Pagination for Initial Pages: For initial pages, or when a user navigates sequentially (like from page 1 to 2), the standard pagination approach works well. This avoids the need for OFFSET in most cases.
Jumping Between Pages: When a user jumps across multiple pages (for example, from page 1 to 10), OFFSET can be used. However, this is less frequent in typical user interactions.
Leveraging the ‘Last Seen’ ID: When the UI knows the ID of the last item on the current page (the ‘last seen’ ID), it can send this ID to the backend for the next query. For instance, if a user is on page 2 and requests page 3, the UI sends the last ID of page 2 to the backend.
Backend Query Optimization: The backend then queries for the next set of results where the IDs are greater than the ‘last seen’ ID. This method avoids the need for OFFSET and allows the database to quickly locate the starting point for the next page, significantly improving query performance.
Handling Large Jumps: For large jumps, where OFFSET might be necessary, a hybrid approach can be used. The backend can combine the ‘last seen’ ID method for smaller ranges and revert to OFFSET for larger, less frequent jumps.

Advantages:

Improved Performance: This method greatly reduces the time and resources needed to query large datasets.
Scalability: It scales better for large datasets, where traditional OFFSET pagination becomes inefficient.
User Experience: Maintains a smooth and responsive user experience by optimizing backend data retrieval.

Disadvantages:

Not easy to implement comparing to traditional approach
Unnecessary for smaller datasets, where queries already respond within the desired time.
Additional index required if you want to order by different column like creation date.