Python Character Detection — chardet

Be sure your sentiment and text analytics is actually processing characters in your target language

Dawn Moyer
Dec 3, 2020 · 5 min read
Image for post
Image for post
Image by Free-Photos from Pixabay

One of the most important parts of working with technology and data is to keep learning. Not just courses, but also randomly challenging yourself to learn something completely new.

I decided to use a random number generator to identify packages in the first 20 python packages downloaded according to Python Wheels. I will try to use each package I am ‘assigned.’ My goal is to learn new things, and the bonus is to discover a hidden gem.

Random Pick: #10 chardet 3.0.4 The Universal Character Encoding Detector

What is it?

This package can evaluate a block of text and try to determine if there are any characters found. This can be used for files or webpages.

This is the list of evaluations that are possible.

  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
  • EUC-KR, ISO-2022-KR (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859–5, windows-1251 (Cyrillic)
  • ISO-8859–5, windows-1251 (Bulgarian)
  • ISO-8859–1, windows-1252 (Western European languages)
  • ISO-8859–7, windows-1253 (Greek)
  • ISO-8859–8, windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)

Why would I want to use it?

Does the object the code is analyzing contain characters? If you perform sentiment analysis or readability analysis, you want to be sure you aren’t wasting resources on invalid text. It also attempts to identify the language.

My code exploring the package:

When I ran the pip install chardet, I was notified that I already have this package installed. It must be dependant on another larger NLP package, I imagine.

I tried a variety of websites, including non-English websites. My output did not vary much. I even restarted my kernel thinking it was some sort of issue in memory. It wasn’t. The package also did not identify the language as English on a webpage. Given how much code is on webpages, this seems reasonable upon the first impression.

Detect

# Simple example of determining if a webpage contains characters.  detect
import urllib
import chardet
data = urllib.request.urlopen('https://trends.google.com/trends/?geo=US').read()
chardet.detect(data)
"""output : {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
meaning: The package detects utf-8 characters on the Google Trends website, with a confidence level of 99%

Universal Detector

# Incrementally analyzing data and stopping when significance is reached
import urllib
from chardet.universaldetector import UniversalDetector
data = urllib.request.urlopen('https://trends.google.com/trends/?geo=US')
detector = UniversalDetector()
for line in data.readlines():
print(line)
detector.feed(line)
if detector.done: break
detector.close()
data.close()
print(detector.result)
"""output: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}"""

Not being impressed with the webpage analysis, I tried some text.

## Let’s try some straight up text compliments of Mr Poe.
import chardet
data = ‘Once upon a midnight dreary, while I pondered, weak and weary, \
Over many a quaint and curious volume of forgotten lore — \
While I nodded, nearly napping, suddenly there came a tapping, \
As of some one gently rapping, rapping at my chamber door.’
print(chardet.detect(data.encode()))# output {‘encoding’: ‘Windows-1252’, ‘confidence’: 0.73, ‘language’: ‘’}
# Windows-1252 = Western European Languages
# I get the same result if I don't continue the lines.

If I only use the first line of the poem, I get a different result.

data = 'Once upon a midnight dreary, while I pondered, weak and weary'
print(chardet.detect(data.encode()))
# output {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Let’s try a different text.

data = 'This is my rifle. There are many like it, but this one is mine.'
print(chardet.detect(data.encode()))
# output {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Again, no language. I researched online and found other examples.

print(chardet.detect('test'.encode()))
print(chardet.detect('בדיקה'.encode()))
print(chardet.detect('тест'.encode()))
print(chardet.detect('テスト'.encode()))
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}

Alternatives

While researching why my code wasn’t returning languages and had inconsistent results, I see this is a common issue. Languages are hard, that is obvious. I researched some of the proposed alternatives.

textblob

"""
textblob
https://textblob.readthedocs.io/en/dev/api_reference.html
"""
from textblob import TextBlob
mixed_text = u"""
中国 中國
"""
data = TextBlob(mixed_text)
data.detect_language()
# Be forewarned that this calls Google Translate. I found it quickly errors out HTTP Error 429: Too Many Requests

polyglot

""" polyglot
https://pypi.org/project/polyglot/
A recommended alternative is polyglot if you are able to install it. I'm using Windows so it was giving me fits.
"""
from polyglot.detect import Detectormixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
print(language)
# name: English code: en confidence: 87.0 read bytes: 1154
# name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755
# name: un code: un confidence: 0.0 read bytes: 0

langdetect

"""
langdetect
https://pypi.org/project/langdetect/
"""
from langdetect import detect_langs
mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
detect_langs(mixed_text)
# output: [en:0.999996237910494]
#===================================================================mixed_text = u"""
中国
"""
detect_langs(mixed_text)
# output[zh-cn:0.9999989751891187]
#==================================================================
mixed_text = u"""
тест
"""
detect_langs(mixed_text)
# output [bg:0.9999927872360368]

Conclusion

As a concept, this package, chardet, is definitely useful for text analytics/NLP tasks in python. When I attempted to evaluate non-English websites, utf-8 was always the encoding returned.

Further research on the web states that chardet should be your last resort due to the known issues. Given the limitations, I would assume this package's popularity results from its inclusion in larger packages such as textract and text2math.

So — all in all, it’s interesting to learn what is happening behind the scenes in other larger packages that use chardet and to have an intuition as to the accuracy and limitations to consider.

chardet Report Card:

Is the package worth a deep dive? Yes

Is it a Hidden Gem? No

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Dawn Moyer

Written by

Data Enthusiast, fallible human. A data scientist with a background in both psychology and IT, public speaking in areas of data, career, and ethics.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Dawn Moyer

Written by

Data Enthusiast, fallible human. A data scientist with a background in both psychology and IT, public speaking in areas of data, career, and ethics.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store