Avoiding UnicodeDecodeError with Python CLI tools
It seems that the four years I spend doing i18n work while at Adobe left me with a “passion” for dealing with Unicode issues, usually in Python.
Most of the problems are specific to Python 2.x which has a default encoding for standard streams (stdin/stdout/stderr) of “ascii”, even if python itself can deal with Unicode strings decently. This default encoding is happening even if your terminal is Unicode enabled and using UTF-8 encoding.
This means that print()-inting and logging any message that may have Unicode charactes could endup with a UnicodeDecodeError. Even worse, if you have an Exception containing an unicode message. Python requests library is likely to raise HTTP exceptions that contain such characters, as the internet is Unicode and you cannot control that. When this happens you will see miss to see the real exception and you will endup with a frustrating UnicodeDecodeError that doesn’t help you at all.
The easy way to avoid it on you own machine is to tell python to use utf-8 by default, somethign that was already sortedin Python 3.
export PYTHONIOENCODING=utf-8But this hack is clearly not going to prevent your users from hitting it because is not realistically possible to deploy it in the wild. Even you, as a developer, you will soon find another machine or account which does not have it, or you will encounter this on CI. Probably you do *not* want to use that hack on CI because you would allow your broken code to pass it.
Luckly, after several years of using flaky hacks, I was able to find what looks like a realiable workaround that can safely be used in Python cli scripts. I would advise trying to use it in libraries because it involves reloading sys library and changingthe encoding at runtime, somethign you never want to happen when you import a library.
This means that you would have to use the recipe in your __main__ section only, of (all) your python scripts:
if __name__ == '__main__': if sys.version_info[0] < 3 and sys.getdefaultencoding() != 'utf-8':
stdin, stdout, stderr = sys.stdin, sys.stdout, sys.stderr reload(sys)
sys.stdin, sys.stdout, sys.stderr = stdin, stdout, stderr sys.setdefaultencoding(os.environ.get('PYTHONIOENCODING', 'utf-8'))
It may look complex and weird but not all working solutions are nice, so let me explain it:
- don’t to anything if python 3 or newer
- don’t do anything if default encoding is already utf-8
- save and restore standard streams to avoid undesired behaviour with testing tools that capture them, like pytest.
- reload(sys) is the old magic trick that enables us to call setdefaultencoding().
- force use of utf-8 encoding only if the environment variable was not defined, so we would respect user defined value
And to test if this works instead of waiting for the users to encounter this bug themselves, sometimes in producttion we should also test it. This why I wrote this unicode console output unittest for python jira library.
Feel free to test this solution yourself and please ping me if you adopt it or if you find any bug in it, especially the former case ;)
