BigQuery — Update? Delete & Insert? Simply Insert!

Published in

Tech at Trax

6 min readNov 23, 2022

This post presents a recommended approach for dealing with data updates in Google Cloud BigQuery managed storage. I will describe the learning process that has led us to choosing this approach, as it emphasizes some of the main concepts BigQuery is designed for. If you prefer to skip the detailed description, you can jump directly to the summary section.

Background

My team has recently developed a service that receives image recognition results of supermarket shelves, and writes them into BigQuery in real time. The results are queried using an API to provide up-to-date information about the history of the shelves with respect to product locations. The volume of data this service writes every day is large (~200MB), and the volume of data that is queried in each API call is much larger (~2GB). Due to the large volumes our service and API deal with, we have chosen BigQuery as our data storage tool.

The below table represents a simplified version of the information our application tracks. Each row refers to an image. It captures the locations of all the products that were found in the image. The column ‘recognition_data’ is a repeated struct of the product’s id and its location coordinates.

Image recognition data — image by author

It happens that an image is processed twice, sometimes within a very short timeframe, therefore, there is a need to update its information. In the next section I will describe the approaches we have tested for handling these kind of real time updates, and in particular, the approach we have found as best appropriate for our use case.

Real Time Data Updates

Let’s assume that you develop an application that needs to load real time data into BigQuery’s managed storage. The streaming api allows you to stream records (represented in JSON) in an efficient near real time manner. This was the recommended approach in use cases of frequent single row inserts (a thorough overview) before the storage write api was introduced. The data loaded to BigQuery’s managed storage needs to be updated, sometimes, within minutes since the time it was first loaded.

1. UPDATE/DELETE statements

The first approach I have tested to deal with data updates was the most obvious one - using standard SQL UPDATE statements. It didn’t take long to understand that this approach is not well fitted to BigQuery’s design, and particularly when working with streaming inserts.

Let’s get back to our recognition example. Suppose that there was an issue with the recognition engine, and that image #101 was reanalyzed five minutes after the previous rows were inserted to the table (using the streaming api). Let’s assume that this is the data we wish to see for image #101:

Image #101 most updated data - image by author

If we’ll try to update the previous row (using SQL UPDATE statements), or to delete the previous row and insert the new row (using SQL DELETE and INSERT statement), we might get the following error:

“BigQuery Error : UPDATE or DELETE statement over table would affect rows in the streaming buffer, which is not supported”

Data availability limitation: This error indicates that there is an attempt to modify rows which are still in the streaming buffer. BigQuery does not allow that in order to avoid consistency issues. In a worst case scenario, rows can stay in the streaming buffer for up to 90 minutes. Therefore, if updates can happen in a shorter duration, like it is in the case of our application, this approach is not appropriate.

Even if data updates occur within a longer duration, you should reconsider the UPDATE/DELETE approach.

High cost of single row manipulations: BigQuery is designed for Online Analytical Processing (OLAP) by using table scans, rather than single row lookups that are used by Online Transaction Processing (OLTP) systems (this is why BigQuery doesn’t use indexes). For this reason, UPDATE/DELETE statements that affect a single row are not cost efficient.

At this point I understood that the use of UPDATE statements is not suitable for my application, nor does the use of DELETE statements followed by new streaming inserts. What is the solution then?

2. Don’t throw away history - simply insert

When you update rows you lose historic information. If you just insert new rows that have a timestamp or some sequential id field, you can access the latest information, and you also have the previous information available for different analysis purposes.

Insert only —both historic information and updated information are kept — image by author

Now, all you need is to make sure that you access the latest copy of the rows in your production environment.

Deduplication:

To access the most updated information, you can run a query that removes duplicate rows. A typical deduplication query includes a numbering function like ROW_NUMBER() or RANK(), and a PARTITION BY clause.

The query below returns the most updated image recognitions. In the subquery, sequential row numbers are given to each group of rows that share the same partition (the same image_id, and the same image_date). The row numbers are given in a descending order so that the rows with the latest timestamps (in each partition) are given the number 1. The outer query selects the rows that were given the number 1, and removes the row_number column (with the Except function).

SELECT * EXCEPT(row_number)
FROM (
      SELECT *, ROW_NUMBER()
                OVER (PARTITION BY image_id, image_date
                      ORDER BY record_creation_timestamp DESC) row_number
      FROM image_recognition_table)
WHERE row_number = 1

The result of running the above query:

Creating a view* over the above query allows one to run queries on a deduplicated table.

CREATE OR REPLACE VIEW image_recognition_view AS
(SELECT * EXCEPT(row_number)
FROM (
      SELECT *, ROW_NUMBER()
                OVER (PARTITION BY image_id, image_date
                      ORDER BY record_creation_timestamp DESC) row_number
      FROM image_recognition_table)
WHERE row_number = 1)

Note that any query that references the above view, triggers the deduplication query, thus, all the rows/columns accessed in the deduplication query are scanned. This could have a significant impact on the query costs.

* We define a simple view and not a materialized view, as a materialized view wouldn’t add any performance advantage in this case (more on BigQuery’s materialized views).

To Conclude

A best practice for dealing with frequent data updates:
1. Do not use UPDATE/DELETE statements.
2. Insert a new row with up-to-date information, include a timestamp column or a sequential id column that will indicate the order of insertion.
3. To access the most up-to-date rows run a deduplication query.
4. To run queries over the most up-to-date rows, create a view over the deduplication query and query the view.
Note that at the time I worked on this project, legacy streaming was the recommended approach for inserting real time data into BigQuery’s managed storage. As of today, the recommended approach is streaming using storage write api.

For more information on BigQuery’s best practices.