Data Cleaning: Character Encoding

nishant sethi
Sep 2, 2018 · 7 min read
“white printing paper with numbers” by Mika Baumeister on Unsplash

In my previous post of Data Cleaning, we discussed How to parse dates. In this post, we’re going to be working with different character encodings. If you are comfortable with this technique, you can skip to A Machine Learning Classification Model in Python — Part I.To get started, Download the required Dataset which I’m going to use in this post. You can download it from my GitHub link.

Here’s what we’re going to do today:

Let’s get started!

Get our environment set up

The first thing we’ll need to do is load in the libraries we’ll be using. Not our datasets, though: we’ll get to those later!

Now we’re ready to work with some character encodings! (If you like, you can add a code cell here and take this opportunity to take a look at some of the data.)

What are the encodings?

Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like “hi”). There are many different encodings, and if you tried to read in the text with a different encoding than the one it was originally written in, you ended up with a scrambled text called “mojibake” (said like mo-gee-bah-kay). Here’s an example of mojibake:

æ–‡å — 化ã??

You might also end up with an “unknown” characters. There are what gets printed when there’s no mapping between a particular byte and a character in the encoding you’re using to read your byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it’s definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

It was pretty hard to deal with encodings in Python 2, but thankfully in Python 3, it’s a lot simpler. (Kaggle Kernels only use Python 3.) There are two main data types you’ll encounter when working with text in Python 3. One is is the string, which is what text is by default.

The other data is the bytes data type, which is a sequence of integers. You can convert a string into bytes by specifying which encoding it’s in:

If you look at a bytes object, you’ll see that it has a b in front of it, and then maybe some text after. That’s because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn’t really work for writing any language other than English.) Here you can see that our euro symbol has been replaced with some mojibake that looks like “\xe2\x82\xac” when it’s printed as if it were an ASCII string.

When we convert our bytes back to a string with the correct encoding, we can see that our text is all there correctly, which is great!

However, when we try to use a different encoding to map our bytes into a string, we get an error. This is because the encoding we’re trying to use doesn’t know what to do with the bytes we’re trying to pass it. You need to tell Python the encoding that the byte string is actually supposed to be in.

We can also run into trouble if we try to use the wrong encoding to map from a string to bytes. Like I said earlier, strings are UTF-8 by default in Python 3, so if we try to treat them like they were in another encoding we’ll create problems.

For example, if we try to convert a string to bytes for ascii using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text isn’t in ASCII, though, there will be some characters it can’t handle. We can automatically replace the characters that ASCII can’t handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there’s no way to tell which character it should have been. That means we may have just made our data unusable!

This is bad and we want to avoid doing it! It’s far better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert the non UTF-8 input into UTF-8 is when you read in files, which we’ll talk about next.

First, however, try converting between bytes and strings with different encodings and see what happens. Notice what this does to your text. Would you want this to happen to the data you were trying to analyze?

Reading in files with encoding problems

Most files you’ll encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time you won’t run into problems. However, sometimes you’ll get an error like this:

Notice that we get the same UnicodeDecodeError we got when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file isn't actually UTF-8. We don't know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

I’m going to just look at the first ten thousand bytes of this file. This is usually enough for a good guess about what the encoding is and is much faster than trying to look at the whole file. (Especially with a large file this can be very slow.) Another reason to just look at the first part of the file is that we can see by looking at the error message that the first problem is the 11th character. So we probably only need to look at the first little bit of the file to figure out what’s going on.

So chardet is 73% confidence that the right encoding is “Windows-1252”. Let’s see if that’s correct:

Looks like chardet was right! The file reads in with no problem (although we do get a warning about datatypes) and when we look at the first few rows it seems to be fine.

Saving your files with UTF-8 encoding

Finally, once you’ve gone through all the trouble of getting your file into UTF-8, you’ll probably want to keep it that way. The easiest way to do that is to save your files with UTF-8 encoding. The good news is, since UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:

And that’s it! In my next Post on Data Cleaning, I'll discuss Inconsistent values in the dataset. If you have any questions, be sure to post them in the comments below.

I welcome feedback and constructive criticism and can be reached on Facebook.

Happy coding!!

Data Driven Investor

from confusion to clarity, not insanity

nishant sethi

Written by

Senior Software Developer at Infosys

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade