Using Collation in Google BigQuery

How to Compare and Sort Strings easily with SQL

Christianlauer
CodeX

--

After Google has made headlines with new tools in the area of Data Lakehouse (click here for more info), there are now again novelties for Data Analysts working with BigQuery SQL. Google now offers the functionality of case-insensitive collation [1].

What is Collation?

Collation determines how your data is sorted and compared. It’s for example important when working international texts and chars. Collations provide sorting rules, case, and accent sensitivity properties for your data. Collations that are used with character data types, such as char and varchar, dictate the code page and corresponding characters that can be represented for that data type [2].

Example of using Collate in SQL with international letters — Source Github[3]

Example

Here is a small example of how to use it within BigQuery with distinctive representation of the letter ‘c’. One time lowercase, one time uppercase and one time the Czech letter Č.

SELECT * FROM UNNEST([
COLLATE(‘Č’, ‘und:ci’),
‘c’,
‘C’
]) AS character
ORDER BY character

--

--

Christianlauer
CodeX

Big Data Enthusiast based in Hamburg and Kiel. Thankful if you would support my writing via: https://christianlauer90.medium.com/membership