Published in


Byte Stream With UTF-8 Encoding

If you plan to stream large pieces of data in chunks and accommodate any character set other than ASCII, this is a must read…

We recently ran into an interesting problem with regards to UTF-8 byte boundaries when passing byte streams between different layers of an application. Specifically we were downloading a 96GB file into BigQuery 5GB at a time. Every once in a while we would get the error:

The error would be happen seemingly randomly within a download and our first guess was a malformed character in the source. Turns out this is a highly predictable error that will affect any byte stream handling UTF-8 data. To understand both he error and the solution, let’s go over an excerpt from our test function.

The exception occurs because Unicode Characters can be anywhere from 1 to 4 bytes long. The decoder will error if a UTF-8 character is incomplete. So when using generators to pass pieces of a binary file with UTF-8 content and the chunk happens to split a UTF-8 character, a UnicodeDecodeError will occur. For example, this code will intermittently error:

The solution is to adjust the chunk boundaries to align with UTF-8 characters, which is more difficult because they vary in size from 1 to 4 bytes. Luckily UTF-8 standard is well documented as:

Which after flipping back to college binary arithmetic means the start of each UTF-8 character can be detected using:

The resulting coded helper functions are a combination of a find_utf8_split which works backwards from the end of a chunk to return the last complete UTF-8 character position. Plus response_utf_8_stream which buffers the incomplete UTF-8 bytes and prepends them to the next chunk. Together the functions can be used to make any byte stream UTF-8 safe.

We ❤️ internationalization! A always Apache Licensed so you’re completely free to use, modify, and improve.



At gTech, we believe every ad operations team should be faster, nimbler, and able to use all their data sources to drive client impact. To that end, we’ve created StarThinker, a simple and intuitive web UI that allows users to create, edit, run, and schedule data pipelines consis

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store