Surrogate Keys in Google BigQuery

Shailendra Mathur
3 min readAug 21, 2020

--

Google BigQuery is undoubtedly one of the most popular cloud data warehousing platforms available today. Since its launch in 2011, the product has evolved to new heights while still maintaining its simplicity and ease of use.

One of the key steps in setting up a data warehouse is to establish the data pipelines for ingesting data from external sources. Many times, data being ingested requires generation of a unique identifier for each record in a table, also known as a Surrogate Key.

In contrast to a Natural Key which is a column or a set of columns that uniquely identify a record in a table and have a business meaning, a Surrogate Key is a system-generated value (could be GUID, Sequential Integer etc.) with no business meaning.

For example:

  • Auto-generating unique Event IDs while ingesting streaming events.
  • Auto-generating unique IDs for data without an existing primary key.

Generating Surrogate Keys

Google BigQuery does not offer field attributes like IDENTITY (as in MS SQL) and AUTO_INCREMENT (as in MySQL) that can be associated with a field at the time of table definition. Neither does it offer object like Sequence (as in T-SQL) to auto-generate surrogate keys.

Following are some ways to auto-generate surrogate keys in Google BigQuery depending on their format and use case:

Sequential Integers

BigQuery provides an analytic function ROW_NUMBER() that can be used over a window of rows to generate an incremental integer for each row.

SELECT ROW_NUMBER() OVER() AS ID, *
FROM `bigquery-public-data.usa_names.usa_1910_current`
Auto-generated Sequential Integer Surrogate Keys using ROW_NUMBER function in Google BigQuery
Auto-generated sequential Integer values for ID

The above technique is useful when there is a need to populate surrogate keys for data already in a BigQuery table.

Hash

If the records being ingested contain a field or a combination of fields that guarantee uniqueness (Primary Key / Composite Primary Key), this technique can be applied to generate a Hash of those fields to populate the Surrogate key.

In our example, a combination of state, gender, year and name fields are guaranteed to be unique across all records. So, in order to generate a surrogate key, we concatenate these fields and compute a SHA256 digest (returns 32 bytes).

SELECT SHA256(CONCAT(state, gender, year, name)) as ID, *
FROM `bigquery-public-data.usa_names.usa_1910_current`
Auto-generated Hash Values as Surrogate Keys using SHA256 function in Google BigQuery
Auto-generated hash values for ID

The above technique can also be used for streaming ingestion use cases as it is only dependent on the values received in the record being ingested.

Apart from SHA256 function that we have used in the example above, Google BigQuery also offers other hash functions like MD5 (returns 16 bytes hash), SHA1 (returns 20 bytes hash) and SHA512 (returns 64 bytes hash) that can also be used. Each of these hash functions return values in BYTES datatype.

UUID

One of the recent additions to Google BigQuery is the GENERATE_UUID function. This function generates a universally unique identifier that consists of 32 hexadecimal digits separated by hyphens in the form 8–4–4–4–12. The return value is of String datatype.

SELECT GENERATE_UUID() as ID, *
FROM `bigquery-public-data.usa_names.usa_1910_current
Auto-generated UUID Values as Surrogate Keys using GENERATE_UUID function in Google BigQuery
Auto-generated UUID values for ID

Conclusion

Considering that Surrogate keys do not have any contextual or business meaning, they allow for unlimited values and stand the test of time as they do not get impacted by changing business environment.

Google BigQuery provides multiples ways to auto-generate surrogate keys depending on the format and use case being addressed.

--

--