Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Introducing partial ordering mode for BigQuery DataFrames (bigframes)

6 min readMar 12, 2025

--

BigQuery DataFrames aka BigFrames is an open source Python library offered by Google. BigFrames scales Python data processing by transpiling common Python data science APIs to BigQuery SQL. You can read more about BigFrames in the official introduction to BigQuery DataFrames aka BigFrames and can refer to the public git repository for BigQuery DataFrames aka BigFrame.

Set the bigframes.pandas.options.bigquery.ordering_mode = "partial" option to prevent BigQuery DataFrames from creating a deterministic ordering and sequential index. Later in this post, I show an example of where this has a 4,000,000x speed up in terms of bytes scanned.

Press enter or click to view image in full size
This feature makes the pandas-like BigQuery DataFrames faster and more efficient. Image generated by Gemini's Imagen feature.

Without this option set, BigQuery DataFrames creates a sequential index over all rows to mimic pandas behavior. On large tables this can get expensive, as it forces a full table scan, unless row and column filters are provided as arguments to read_gbq. Partial ordering mode allows BigQuery DataFrames to push down many more row and column filters. On large clustered and partitioned tables, this can greatly reduce the number of bytes scanned and computation slots used.

A brief primer on indexes in pandas and BigQuery DataFrames

In pandas_gbq, when you download a BigQuery table as a DataFrame, pandas assigns a default index to the rows base on the order they were downloaded.

import pandas_gbq

df = pandas_gbq.read_gbq(
"bigquery-public-data.ml_datasets.penguins",
project_id=my_project)
df
                                       species     island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g     sex
0 Adelie Penguin (Pygoscelis adeliae) Dream 36.6 18.4 184.0 3475.0 FEMALE
1 Adelie Penguin (Pygoscelis adeliae) Dream 39.8 19.1 184.0 4650.0 MALE
2 Adelie Penguin (Pygoscelis adeliae) Dream 40.9 18.9 184.0 3900.0 MALE
3 Chinstrap penguin (Pygoscelis antarctica) Dream 46.5 17.9 192.0 3500.0 FEMALE
4 Adelie Penguin (Pygoscelis adeliae) Dream 37.3 16.8 192.0 3000.0 FEMALE
.. ... ... ... ... ... ... ...
339 Adelie Penguin (Pygoscelis adeliae) Torgersen 38.8 17.6 191.0 3275.0 FEMALE
340 Adelie Penguin (Pygoscelis adeliae) Torgersen 40.9 16.8 191.0 3700.0 FEMALE
341 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.0 17.1 191.0 3050.0 FEMALE
342 Adelie Penguin (Pygoscelis adeliae) Torgersen 40.6 19.0 199.0 4000.0 MALE
343 Adelie Penguin (Pygoscelis adeliae) Torgersen 37.3 20.5 199.0 3775.0 MALE

[344 rows x 7 columns]

The index for a particular row doesn’t change, even if you apply a row filter to that DataFrame.

df[df["species"].str.contains("Chinstrap")]
                                       species island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g     sex
3 Chinstrap penguin (Pygoscelis antarctica) Dream 46.5 17.9 192.0 3500.0 FEMALE
6 Chinstrap penguin (Pygoscelis antarctica) Dream 46.9 16.6 192.0 2700.0 FEMALE
7 Chinstrap penguin (Pygoscelis antarctica) Dream 50.5 18.4 200.0 3400.0 FEMALE
8 Chinstrap penguin (Pygoscelis antarctica) Dream 49.5 19.0 200.0 3800.0 MALE
13 Chinstrap penguin (Pygoscelis antarctica) Dream 47.0 17.3 185.0 3700.0 FEMALE
.. ... ... ... ... ... ... ...
115 Chinstrap penguin (Pygoscelis antarctica) Dream 48.5 17.5 191.0 3400.0 MALE
119 Chinstrap penguin (Pygoscelis antarctica) Dream 46.4 17.8 191.0 3700.0 FEMALE
120 Chinstrap penguin (Pygoscelis antarctica) Dream 47.5 16.8 199.0 3900.0 FEMALE
121 Chinstrap penguin (Pygoscelis antarctica) Dream 48.1 16.4 199.0 3325.0 FEMALE
123 Chinstrap penguin (Pygoscelis antarctica) Dream 55.8 19.8 207.0 4000.0 MALE

[68 rows x 7 columns]

Why does pandas do this? A big reason is that in pandas, when you combine two objects, you are actually joining on an index.

s1 = pd.Series([1, 2, 3], index=[0, 2, 4], dtype="Int64")
s2 = pd.Series([4, 5, 6], index=[0, 1, 2], dtype="Int64")

# Align the two Series by their respective indexes and then add.
s1 + s2
# 0 5
# 1 <NA>
# 2 8
# 4 <NA>
# dtype: Int64

By preserving the index after the filter, pandas can still join the results back to the original DataFrame. In a SQL system like BigQuery, to do the same one needs a unique key to join by. pandas provides such a key implicitly for all objects. (Aside: it is possible to create a pandas object where the index is not unique. The behavior when joining on the index can get a bit unpredictable in this case.)

As a pandas-like API, BigQuery DataFrames (bigframes) also creates a sequential index based on the row order if no other index, such as a primary key, is available. Just like pandas, BigQuery DataFrames uses this index to implicitly join DataFrame and Series objects.

To get determinism, BigQuery DataFrames hashes all columns to provide a total ordering of the rows1. To get a sequential index, BigQuery DataFrames uses the ROW_NUMBER() analytic operation. Notably, in BigQuery DataFrames and pandas, the index values remain stable, even after row filters are applied.

s = pd.Series([1, 2, 3, 4, 5, 6], dtype="Int64")
s[s % 2 == 0]
# 1 2
# 3 4
# 5 6
# dtype: Int64

This allows two objects with different filters to be implicitly joined together, matching on the same rows as if the filter hadn’t been applied.

s = pd.Series([1, 2, 3, 4, 5, 6], dtype="Int64")
s[(1 < s) & (s < 6)]
# 1 2
# 3 4
# 5 6
# dtype: Int64

s[s % 2 == 0] + s[(1 < s) & (s < 6)]
# 1 4
# 2 <NA>
# 3 8
# 4 <NA>
# 5 <NA>
# dtype: Int64

While useful in some contexts, creating a sequential index is an expensive operation, especially on clustered and partitioned tables. The hash for deterministic ordering means that column filters don’t work as expected and the analytical ROW_NUMBER() operation over the whole table means that row filters don’t work as expected.

Partial ordering mode and NULL indexes

To prevent unexpected full table scans, BigQuery DataFrames warns with a bigframes.exceptions.DefaultIndexWarning when reading from partitioned and clustered tables with no index_col or filters defined. For tables without a suitable set of columns to use as the index_col, as of BigQuery DataFrames (bigframes) version 1.7.0, you can set index_col=bigframes.enums.DefualtIndexKind.NULL in read_gbq().

When setting this option, the analytic ROW_NUMBER operation is removed, allowing row filters to be pushed down. This has the potential to save you a lot of bytes scanned in workloads on clustered and partitioned tables. The tradeoff is that now unrelated DataFrame and Series objects can no longer be implicitly joined. In some cases, BigQuery DataFrames can determine that two objects are derived from the same table expression, but this isn’t always possible.

As of BigQuery DataFrames (bigframes) version 1.12.0, you can also disable a default ordering by setting the bigframes.pandas.options.bigquery.ordering_mode = “partial” option. This turns off default sequential indexes and also sets the default index kind to NULL. This allows both row filters and column filters to work as expected.

Trying it out

Query the last few days of data from the bigquery-public-data.pypi.file_downloads table, which is partitioned by the timestamp column and clustered by the project column with ordering_mode = “partial”. Do not try this with the default sequential index, as the table is 375+ TB total logical bytes large and growing.

import datetime

import bigframes.exceptions
import bigframes.pandas as bpd

# IMPORTANT: use partial ordering mode to allow filters to work as expected.
bpd.options.bigquery.ordering_mode = "partial"

# Show a preview of the previous day's downloads.
pypi = bpd.read_gbq("bigquery-public-data.pypi.file_downloads")
last_1_days = (
datetime.datetime.now(datetime.timezone.utc)
- datetime.timedelta(days=1)
)
bigframes_downloads = pypi[
(pypi["timestamp"] > last_1_days)
& (pypi["project"] == "bigframes")
]
bigframes_downloads[["timestamp", "project", "file"]].peek()
                   timestamp    project                                               file
0 2025-03-12 10:45:41+00:00 bigframes {'filename': 'bigframes-0.22.0-py2.py3-none-an...
1 2025-03-12 10:34:06+00:00 bigframes {'filename': 'bigframes-0.22.0-py2.py3-none-an...
2 2025-03-12 10:39:20+00:00 bigframes {'filename': 'bigframes-0.22.0-py2.py3-none-an...
3 2025-03-12 10:38:37+00:00 bigframes {'filename': 'bigframes-1.40.0-py2.py3-none-an...
4 2025-03-12 10:42:25+00:00 bigframes {'filename': 'bigframes-1.40.0-py2.py3-none-an...

If you remove bpd.options.bigquery.ordering_mode = "partial" (please don’t), then the query BigQuery DataFrames generates processes 376.27 TB. But with this option set, the query BigQuery DataFrames generated only processes 84.8 MB processed. This is 4,000,000+ times speedup in terms of cost for on-demand queries!

Check out the Analyzing package downloads from PyPI with BigQuery DataFrames notebook for a more detailed look at these data using ordering_mode = “partial”. Also, learn more about BigQuery DataFrames and its features at the Use BigQuery DataFrames guide.

Share your feedback

Partial ordering mode and NULL indexes are currently in Preview. Is this feature helpful to you? Is there some functionality you need that is missing? The BigFrames team would love to hear from you. If you would like to reach out please send an email to: bigframes-feedback@google.com or by filing an issue at the open source BigQuery DataFrames aka BigFrames repository. To receive updates about BigFrames, subscribe to the BigQuery DataFrames aka BigFrames email list.

  1. If the table contains duplicate rows, the order of duplicate rows isn’t deterministic, but since the data is identical, the order of the identical rows won’t affect the results. BigQuery DataFrames performs a double hash to prevent collisions.

Copyright 2024. Released under Creative Commons Attribution License. Originally posted September 4, 2024 to https://friendliness.dev/2024/09/04/bigframes-ordering-mode-partial/.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Tim Swena (Swast)
Tim Swena (Swast)

Written by Tim Swena (Swast)

I code and draw pixel art. Software developer @googlecloud Views expressed here are my own. Homepage: https://www.timswast.com/

No responses yet