Byte Stream With UTF-8 Encoding

Paul Kenjora At Google
StarThinker
Published in
2 min readAug 23, 2021

If you plan to stream large pieces of data in chunks and accommodate any character set other than ASCII, this is a must read…

We recently ran into an interesting problem with regards to UTF-8 byte boundaries when passing byte streams between different layers of an application. Specifically we were downloading a 96GB file into BigQuery 5GB at a time. Every once in a while we would get the error:

‘utf-8’ codec can’t decode byte 0xef in position 0: invalid continuation byte

The error would be happen seemingly randomly within a download and our first guess was a malformed character in the source. Turns out this is a highly predictable error that will affect any byte stream handling UTF-8 data. To understand both he error and the solution, let’s go over an excerpt from our test function.

# Works OK
string_ascii = bytes('"#$%&()*+,-./0123456789:;ABC', 'utf-8')
string_ascii[:17].decode("utf-8")
# Works OK
string_cjk = bytes('豈更車勞擄櫓爐盧老蘆虜路露魯鷺碌祿綠', 'utf-8')
string_cjk[:4].decode("utf-8")
# Raises Exception UnicodeDecodeError
string_cjk = bytes('豈更車勞擄櫓爐盧老蘆虜路露魯鷺碌祿綠', 'utf-8')
string_cjk[:17].decode("utf-8")

The exception occurs because Unicode Characters can be anywhere from 1 to 4 bytes long. The decoder will error if a UTF-8 character is incomplete. So when using generators to pass pieces of a binary file with UTF-8 content and the chunk happens to split a UTF-8 character, a UnicodeDecodeError will occur. For example, this code will intermittently error:

while true:
chunk = some_binary_file.read(17)
if not chunk: break
yield(chunk.decode('utf-8'))

The solution is to adjust the chunk boundaries to align with UTF-8 characters, which is more difficult because they vary in size from 1 to 4 bytes. Luckily UTF-8 standard is well documented as:

1 Bytes: 0xxxxxxx
2 Bytes: 110xxxxx 10xxxxxx
3 Bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Which after flipping back to college binary arithmetic means the start of each UTF-8 character can be detected using:

1 Bytes: 0xxxxxxx - 0x80 == 0x00
2 Bytes: 110xxxxx - 0xE0 == 0xC0
3 Bytes: 1110xxxx - 0xF0 == 0xE0
4 bytes: 11110xxx - 0xF8 == 0xF0

The resulting coded helper functions are a combination of a find_utf8_split which works backwards from the end of a chunk to return the last complete UTF-8 character position. Plus response_utf_8_stream which buffers the incomplete UTF-8 bytes and prepends them to the next chunk. Together the functions can be used to make any byte stream UTF-8 safe.

w̶h̶i̶l̶e̶ ̶t̶r̶u̶e̶:̶
c̶h̶u̶n̶k̶ ̶=̶ ̶s̶o̶m̶e̶_̶b̶i̶n̶a̶r̶y̶_̶f̶i̶l̶e̶.̶r̶e̶a̶d̶(̶1̶7̶)̶
i̶f̶ ̶n̶o̶t̶ ̶c̶h̶u̶n̶k̶:̶ ̶b̶r̶e̶a̶k̶
y̶i̶e̶l̶d̶(̶c̶h̶u̶n̶k̶.̶d̶e̶c̶o̶d̶e̶(̶'̶u̶t̶f̶-̶8̶'̶)̶)̶
from starthinker.util.csv import response_utf8_streamresponse_utf8_stream(some_binary_file, 17)

We ❤️ internationalization! A always Apache Licensed so you’re completely free to use, modify, and improve.

--

--