Member-only story
Besides Word Embedding, why you need to know Character Embedding?
In 2013, Tomas Mikolov [1] introduced word embedding to learn a better quality of word. At that time word embedding is state of the art on dealing with text. Later on, doc2vec is introduced as well. What if we think in another angel? Instead of aggregate from word to document, is it possible to aggregate from character to word.
In this article, you will go through what, why, when and how on Character Embedding.
What?
Xiang and Yann [2] introduced character CNN. They found that character includes key signal to improve model performance. In the paper, a list of character are defined 70 characters which including 26 English letters, 10 digits, 33 special characters and new line character.
# Copy from Char CNN paperabcdefghijklmnopqrstuvwxyz0123456789 -,;.!?:’’’/\|_@#$%ˆ&*˜‘+-=()[]{}
On the other hand, Google Brain team introduced Exploring the Limits of Language Modeling and released the lm_1b model which includes 256 vectors (including 52 characters, special characters) and the dimension is just 16. By comparing to word embedding, the dimension can increase up to 300 while the number of vectors is huge.