StarThinker
Published in

StarThinker

Byte Stream With UTF-8 Encoding

If you plan to stream large pieces of data in chunks and accommodate any character set other than ASCII, this is a must read…

We recently ran into an interesting problem with regards to UTF-8 byte boundaries when passing byte streams between different layers of an application. Specifically we were downloading a 96GB file into BigQuery 5GB at a time. Every once in a while we would get the error:

The error would be happen seemingly randomly within a download and our first guess was a malformed character in the source. Turns out this is a highly predictable error that will affect any byte stream handling UTF-8 data. To understand both he error and the solution, let’s go over an excerpt from our test function.

The exception occurs because Unicode Characters can be anywhere from 1 to 4 bytes long. The decoder will error if a UTF-8 character is incomplete. So when using generators to pass pieces of a binary file with UTF-8 content and the chunk happens to split a UTF-8 character, a UnicodeDecodeError will occur. For example, this code will intermittently error:

The solution is to adjust the chunk boundaries to align with UTF-8 characters, which is more difficult because they vary in size from 1 to 4 bytes. Luckily UTF-8 standard is well documented as:

Which after flipping back to college binary arithmetic means the start of each UTF-8 character can be detected using:

The resulting coded helper functions are a combination of a find_utf8_split which works backwards from the end of a chunk to return the last complete UTF-8 character position. Plus response_utf_8_stream which buffers the incomplete UTF-8 bytes and prepends them to the next chunk. Together the functions can be used to make any byte stream UTF-8 safe.

We ❤️ internationalization! A always Apache Licensed so you’re completely free to use, modify, and improve.

--

--

At gTech, we believe every ad operations team should be faster, nimbler, and able to use all their data sources to drive client impact. To that end, we’ve created StarThinker, a simple and intuitive web UI that allows users to create, edit, run, and schedule data pipelines consis

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store