The Page Token Method in BigQuery: An Efficient Approach to Pagination

Effortlessly navigate through vast amounts of data with the Page Token Method in BigQuery.

Siddharth Gangwar
BloggingTimes

--

Say goodbye to slow and clunky pagination with this cursor-based technique that allows you to retrieve specific pages of results with ease, all while reducing the amount of data that needs to be processed. Experience lightning-fast results, even with the largest of datasets.

In this blog, we will discuss the page token method in BigQuery and why it is a more efficient approach to pagination compared to the traditional offset and limit method.

What is the Page Token Method?

The page token method is a cursor-based mechanism for pagination in BigQuery that allows you to efficiently retrieve a specific page of results from a query. The page token is generated after the first query and passed in the subsequent queries to determine the starting point for the next page of results.

Why is the Page Token Method Efficient?

The page token method is efficient because it eliminates the need for BigQuery to scan through all the rows before the offset each time a new page of results is requested. When you use the offset and limit method, the query engine scans through all the rows before the offset and then returns the specified number of rows, which can be slow for large datasets.

However, when you use the page token method, the query engine uses the page token to determine the starting point for the next page of results and only returns the next set of rows, rather than scanning through all the rows before the offset. This reduces the amount of data that needs to be processed and makes it possible to efficiently retrieve large result sets.

In addition, the page token method is also more flexible than the offset and limit method, as it allows you to retrieve any page of results without having to scan through all the previous pages.

How to Use the Page Token Method in BigQuery

First query

To use the page token method in BigQuery, you must first perform a query that generates a result set. After the query is completed, the page token is generated and contains information about the position of the last row of the result set.

#standardSQL
SELECT column1, column2, column3
FROM `your_dataset.your_table`
LIMIT 100

In this first query, we’re selecting the first 100 rows from the table and asking BigQuery to return the values of column1, column2, and column3. After the query is executed, BigQuery will generate a page token which contains information about the position of the last row of the result set.

Subsequent query

In subsequent queries, the page token is passed as a parameter to the query, and the query engine uses this page token to determine the starting point for the next page of results. This process is repeated until all the pages of results have been retrieved.

#standardSQL
SELECT column1, column2, column3
FROM `your_dataset.your_table`
PAGE_TOKEN :'your_page_token_value'
LIMIT 100

In the subsequent query, we’re passing the page token value as a parameter to the query using the PAGE_TOKEN syntax. The query engine will use this page token to determine the starting point for the next page of results and only return the next 100 rows, rather than scanning through all the rows before the offset.

This process can be repeated until all the pages of results have been retrieved.

SQL Views example for Big Query

Suppose you have a table called my_table with columns id and value, and you want to paginate the data sorted by id in ascending order. Here are the steps you can follow to implement cursor-based pagination using views:

Create a View:

CREATE VIEW my_view AS
SELECT id, value
FROM my_table
ORDER BY id ASC

This creates a view called my_view that selects the id and value columns from my_table and orders the rows by id in ascending order.

Retrieve the First Page:

SELECT id, value
FROM my_view
LIMIT 100

This retrieves the first 100 rows of the my_view view. Since we have not defined a cursor yet, this is the first page of the result set.

Retrieve the Next Page:

SELECT id, value
FROM my_view
WHERE id > [last_cursor_value]
ORDER BY id ASC
LIMIT 100

Assuming we retrieved the first 100 rows and want to retrieve the next 100 rows, we can use the WHERE clause to filter the rows with an id value greater than the last_cursor_value of the previous page. We can then order the result set by id in ascending order and limit the output to 100 rows.

Retrieve the Previous Page:

SELECT id, value
FROM my_view
WHERE id < [first_cursor_value]
ORDER BY id DESC
LIMIT 100

Assuming we have already retrieved the second page and want to go back to the first page, we can use the WHERE clause to filter the rows with an id value less than the first_cursor_value of the first page. We can then order the result set by id in descending order and limit the output to 100 rows.

Note that the [last_cursor_value] and [first_cursor_value] parameters in the queries above should be replaced with the actual values of the id column from the last row of the previous page and the first row of the current page, respectively.

Advantages

  • Fast. We use only indexed columns and avoid the slow offset clause. Besides, we don’t execute an additional and expensive count query.
  • Reliable: No elements are missed even if elements are changed during a pagination run & we can’t run into endless loops; even with small page sizes and timestamp columns with second precision.
  • Good Efficiency. Due to the skipping (the offset), we reduce the number of elements that are delivered more than one time. It’s way better than using only a timestamp (simple keyset-pagination). However, in case of fallbacks, elements can be delivered multiple times.
  • Stateless. No state on the server-side is required. So every server instance can handle the requests. This eases horizontal scaling.
  • Easy client implementation and evolvability. The client doesn’t have to fiddle around with limit and offset calculations. He just passes the received token back to the server — as a black box. He doesn’t have to analyze or understand the token. This, in turn, makes later changes in the token structure easy.
  • No Expiration. The token is always valid. We can stop a pagination run at any time or position and resume later. If our client or the service crashes during a pagination run, we can easily resume at the point we stopped. We won’t miss any element. We only have to persist the current token after each request for a page.

Drawbacks

  • The correct implementation is non-trivial. There are many corner cases that have to be taken into account. Proper unit testing is absolutely required.
  • Performance issues when a huge amount (let’s say > 15 000 elements) have the same timestamp. In this case, the offset part of the token becomes pretty huge and so the limit clause of the SQL query does because it’s calculated with the offset (offset + pageSize in order to calculate the checksum). Unfortunately, limit doesn’t scale well. The query execution and data transfer will take longer and longer because we have to transfer more and more data to the algorithm in the application layer. Finally, we’ll end up with responses getting slower and slower and finally timing out. Please mind that having this amount of same timestamps is not unrealistic if you run bulk updates. Bottom line: If you are facing more than, let’s say, 15 000 elements with the same timestamp you should check out the TI token approach instead. Besides, it’s much easier to implement.
  • In case of a fallback due to a checksum inequality, we may deliver multiple elements multiple times to the client (depending on the number of elements with the same timestamp). So not only the element that was updated is delivered twice.

Conclusion

In conclusion, the page token method in BigQuery is a more efficient approach to pagination compared to the traditional offset and limit method. It eliminates the need for BigQuery to scan through all the rows before the offset each time a new page of results is requested and reduces the amount of data that needs to be processed, making it possible to efficiently retrieve large result sets.

In addition, the page token method is also more flexible, as it allows you to retrieve any page of results without having to scan through all the previous pages. If you are working with large datasets in BigQuery, the page token method is definitely worth considering as a more efficient approach to pagination.

Note that the page token method is only available in standard SQL, not legacy SQL.

Hey, this blog series is a journey of continual learning and growth as a software engineer: BloggingTimes. By focusing on the core concepts related to software engineering, we will be able to deepen our understanding and improve our skills every day.

Let’s work together to constantly advance our knowledge and expertise in the field of software engineering.

Join me in this exciting journey by clapping, following, and subscribing to this blog.

--

--

Siddharth Gangwar
BloggingTimes

I'm a problem solver at heart. Whether the challenge is big or small, I'm passionate about finding efficient solutions to any type of problem.