Python Character Detection — chardet
Be sure your sentiment and text analytics is actually processing characters in your target language

One of the most important parts of working with technology and data is to keep learning. Not just courses, but also randomly challenging yourself to learn something completely new.
I decided to use a random number generator to identify packages in the first 20 python packages downloaded according to Python Wheels. I will try to use each package I am ‘assigned.’ My goal is to learn new things, and the bonus is to discover a hidden gem.
Random Pick: #10 chardet 3.0.4 The Universal Character Encoding Detector
What is it?
This package can evaluate a block of text and try to determine if there are any characters found. This can be used for files or webpages.
This is the list of evaluations that are possible.
- ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
- Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
- EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
- EUC-KR, ISO-2022-KR (Korean)
- KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859–5, windows-1251 (Cyrillic)
- ISO-8859–5, windows-1251 (Bulgarian)
- ISO-8859–1, windows-1252 (Western European languages)
- ISO-8859–7, windows-1253 (Greek)
- ISO-8859–8, windows-1255 (Visual and Logical Hebrew)
- TIS-620 (Thai)
Why would I want to use it?
Does the object the code is analyzing contain characters? If you perform sentiment analysis or readability analysis, you want to be sure you aren’t wasting resources on invalid text. It also attempts to identify the language.
My code exploring the package:
When I ran the pip install chardet, I was notified that I already have this package installed. It must be dependant on another larger NLP package, I imagine.
I tried a variety of websites, including non-English websites. My output did not vary much. I even restarted my kernel thinking it was some sort of issue in memory. It wasn’t. The package also did not identify the language as English on a webpage. Given how much code is on webpages, this seems reasonable upon the first impression.
Detect
# Simple example of determining if a webpage contains characters. detect
import urllib
import chardetdata = urllib.request.urlopen('https://trends.google.com/trends/?geo=US').read()
chardet.detect(data)
"""output : {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
meaning: The package detects utf-8 characters on the Google Trends website, with a confidence level of 99%
Universal Detector
# Incrementally analyzing data and stopping when significance is reached
import urllib
from chardet.universaldetector import UniversalDetector
data = urllib.request.urlopen('https://trends.google.com/trends/?geo=US')
detector = UniversalDetector()
for line in data.readlines():
print(line)
detector.feed(line)
if detector.done: break
detector.close()
data.close()
print(detector.result)"""output: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}"""
Not being impressed with the webpage analysis, I tried some text.
## Let’s try some straight up text compliments of Mr Poe.
import chardetdata = ‘Once upon a midnight dreary, while I pondered, weak and weary, \
Over many a quaint and curious volume of forgotten lore — \
While I nodded, nearly napping, suddenly there came a tapping, \
As of some one gently rapping, rapping at my chamber door.’print(chardet.detect(data.encode()))# output {‘encoding’: ‘Windows-1252’, ‘confidence’: 0.73, ‘language’: ‘’}
# Windows-1252 = Western European Languages
# I get the same result if I don't continue the lines.
If I only use the first line of the poem, I get a different result.
data = 'Once upon a midnight dreary, while I pondered, weak and weary'
print(chardet.detect(data.encode()))# output {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
Let’s try a different text.
data = 'This is my rifle. There are many like it, but this one is mine.'
print(chardet.detect(data.encode()))# output {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
Again, no language. I researched online and found other examples.
print(chardet.detect('test'.encode()))
print(chardet.detect('בדיקה'.encode()))
print(chardet.detect('тест'.encode()))
print(chardet.detect('テスト'.encode())){'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}
Alternatives
While researching why my code wasn’t returning languages and had inconsistent results, I see this is a common issue. Languages are hard, that is obvious. I researched some of the proposed alternatives.
textblob
"""
textblob
https://textblob.readthedocs.io/en/dev/api_reference.html
"""from textblob import TextBlob
mixed_text = u"""
中国 中國
"""
data = TextBlob(mixed_text)
data.detect_language()# Be forewarned that this calls Google Translate. I found it quickly errors out HTTP Error 429: Too Many Requests
polyglot
""" polyglot
https://pypi.org/project/polyglot/
A recommended alternative is polyglot if you are able to install it. I'm using Windows so it was giving me fits.
"""from polyglot.detect import Detectormixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
print(language)# name: English code: en confidence: 87.0 read bytes: 1154
# name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755
# name: un code: un confidence: 0.0 read bytes: 0
langdetect
"""
langdetect
https://pypi.org/project/langdetect/
"""
from langdetect import detect_langsmixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""detect_langs(mixed_text)
# output: [en:0.999996237910494]#===================================================================mixed_text = u"""
中国
"""detect_langs(mixed_text)
# output[zh-cn:0.9999989751891187]#==================================================================
mixed_text = u"""
тест
"""detect_langs(mixed_text)
# output [bg:0.9999927872360368]
Conclusion
As a concept, this package, chardet, is definitely useful for text analytics/NLP tasks in python. When I attempted to evaluate non-English websites, utf-8 was always the encoding returned.
Further research on the web states that chardet should be your last resort due to the known issues. Given the limitations, I would assume this package's popularity results from its inclusion in larger packages such as textract and text2math.
So — all in all, it’s interesting to learn what is happening behind the scenes in other larger packages that use chardet and to have an intuition as to the accuracy and limitations to consider.
chardet Report Card:
Is the package worth a deep dive? Yes
Is it a Hidden Gem? No