Special Characters handling in BigQuery

Murli Krishnan
Google Cloud - Community
6 min readAug 26, 2023

This blog focusses on understanding encoding support for BigQuery, how to detect special characters in BigQuery and how to handle special characters

Lets start of with an example of loading a sample file as below to BigQuery.
The file is comma separated file with quotes around columns.

Sample Delimited File with 2 columns — sample_file.txt

Lets try to run the below BigQuery load command to load the above file to BigQuery table.

BQ Load command

When the above command is executed, we encounter the first error as below.

Line 8 above indicates the load has failed due to ASCII 0

Welcome to the world of special characters and encoding !

Let’s try to understand what is ASCII and ASCII 0.

ASCII stands for American Standard Code for Information Interchange.
ASCII is a type of character encoding that represents character in form 7 bit binary encoding, meaning the characters are given a unique 7 bit representation.
Encoding is process of telling machine the real-word representation of character in binary form.

Example — “A” in ASCII is represented as 01000001 in binary, 65 in decimal and 41 in hex

List of all the ASCII support 128 characters are available here.

Now, lets try to understand what is ASCII (0). If we go through the ASCII chart, we see ASCII 0 is NUL character.

ASCII 0

From the above, we see that hexadecimal code for NUL is \x00 (\x stands for hex)

Lets try to check the file to understand how to detect ASCII 0.
The above error message mentioned the issue is at position 0, so lets use gsutil capability to read the required bytes.

gsutil cat -r 0-40 gs://$PROJECT/special_character/sample_file.txt

Output
"1","This is control character (nul) - "

The above command shows the first row, but we do not see any NUL character and reason being is the NUL character is part of control character group (ASCII 0–31) which are non-printable characters.

There are couple of ways to see the NUL character using vim editor, using cat -v command

Example of cat -v command is as below

gsutil cat -r 0-40 gs://$PROJECT/special_character/sample_file.txt | cat -v

Output
"1","This is control character (nul) - ^@" --> ^@ is representation for NUL character

Now, we have established there is a NUL character and detected its presence in file, but why does BigQuery load fail with ASCII 0.

This is default behaviour of BigQuery by which the BigQuery load fails if it encounters ASCII 0 character.

Lets try to add preserve_ascii_control_characters = true to the load and check further.

Now we see the error is not due to NUL character anymore.

Lets try to dig into this issue, it says Missing close quote character (“) at position 90.

Lets try to investigate by checking row starting at byte 90

gsutil cat -r 90-140 gs://gcp-sandbox-1-359004/special_character/sample_file.txt | cat -v

Output
"3","This is example of control character -^M-"

Now see there is another character ^M (Control M), this is now another control character — carriage return character (\r).

Carriage Return

The characteristic of this character is when viewed in the editor, it introduces a line break.

Line 3 and 4

The way to handle this with bigquery is to provide another property (allow_quoted_newlines).
In this case, the columns are enclosed by quotes so the property (allow_quoted_newlines) can be used.
In cases where the columns are not enclosed by quotes, the carriage return character needs to be removed before processing the file as below

sed 's/\r//g' sample_file.txt | cat -v

Now, posting adding the property to BQ load command, lets see the output.

Finally we have got the data loaded into the BigQuery table.

So far, we know from the ASCII table that 0–31 codes (\x00-\x1f) are non-printable characters of which 2 examples we have seen above.

The character range (32–127) (\x20-\x7f) are the majority printable characters that are used in english.

So what about symbols like € and 样本, How are these represented.

The ASCII range (7 bit) (128 characters) was not clearly not enough to accommodate languages other than english, emojis etc.

This resulted in multiple encodings like ISO-8859–1 with the 8 bit encoding to accomodate latin characters.

Enter the world of UNICODE.

The unicode consortium defined symbols for range of characters from different languages and emojis with unicode 15.0 supporting nearly 149,186 characters.
To represent 149,186 characters, there is need to have at least 21 bits (2 power 21).

The unicode came up with 4 byte representation for the above characters (32 bits).
This resulted in emergence of UTF-8 (popular), UTF-16 and UTF-32 encoding.

We will focus on UTF-8 encoding as BigQuery by default supports UTF-8 encoding.

Link

Note for the files which are not UTF-8, it is recommeneded to convert to UTF-8 using utilities like iconv

Example of character “A” in UTF-8 is represented as \u0041

Lets have a look at the data loaded in BigQuery

As you can see, the Row 1 is having a NUL character which is not displayed.
The Row 2 is having a non-breaking space character ( ) which is not displayed.
The Row 3 is having carriage return character which is displayed by the line break
The Row 4 and 5 shows danish and chinese charater which is UTF-8 compliant and depicted in the visual representation.

Now, the question is we have loaded BigQuery with all characters but do we need everything, answer is no, the control characters from 0–31 can be cleansed from data.

But how we detect them if it is not visible.

Lets try to write a simple regex that checks if any control character is present in the column.

The regex checks from ASCII 0 to 31 and presence of c2A0 which is representation for non-breaking space

SELECT field,
regexp_contains(field, r'[\x00-\x1f\xc2\xa0]') as detected
from `$PROJECT.special_character_demo.special_character_table`;

Bigquery additional supports functions like to_hex and to_code_points to investigate the characters in data.

SELECT field,to_hex(cast(field as bytes))
from `$PROJECT.special_character_demo.special_character_table`;
Hex Representation of characters.

Every character is represented by 2 hex characters. Example “T” in row 1 is represented by “54”.

Another easier way to break this is as below

SELECT 
id,
field,
character,
to_hex(cast(character as bytes)) hex_character,
unicode_decimal
from `$PROJECT.special_character_demo.special_character_table`,
UNNEST(split(field,'')) character WITH OFFSET character_offset LEFT JOIN
UNNEST(to_code_points(field)) unicode_decimal WITH OFFSET unicode_decimal_offset
ON character_offset = unicode_decimal_offset;

The above result is broken down character by character with its hex code and unicode decimal point.

Now we try to run the above with the regular expresion to detect the special characters.

SELECT 
id,
field,
character,
character_offset,
to_hex(cast(character as bytes)) hex_character,
unicode_decimal
from `$PROJECT.special_character_demo.special_character_table`,
UNNEST(split(field,'')) character WITH OFFSET character_offset LEFT JOIN
UNNEST(to_code_points(field)) unicode_decimal WITH OFFSET unicode_decimal_offset
ON character_offset = unicode_decimal_offset
where regexp_contains(character, r'[\x00-\x1f\xc2\xa0]');

Now we can clearly see for the rows and exact character offset on what is detected.

Now that we know what we want to standardize, we can make use of regular expression with regex_replace to replace the unwanted characters.

SELECT field,
regexp_replace(regexp_replace(field, r'[\x00-\x1f]',''),r'\xc2\xa0',' ') as cleansed
from `$PROJECT.special_character_demo.special_character_table`;

I have taken json of the results to show the difference.

Hope this article provides enough guidance on getting started with detection of special characters, validating the data, handling the loads and cleansing the data post load.

Happy Learning as always !!

Please connect with me on https://www.linkedin.com/in/murli-krishnan-a1319842/ for any queries.

--

--