Text Normalization in Natural Language Processing (NLP): An Introduction [Part 1]

Ranjan Satapathy
Lingvo Masino
Published in
4 min readJan 9, 2018

--

Social media texts have been increasing exponentially since 2003. We have so much of free data floating around that opinion mining [1] is increasingly becoming popular among government agencies, businesses and academicians. The use of extensive social media has resulted
is a new form of written text called microtext. This poses new challenges
to natural language processing tools which are usually designed for well
written text.

Twitter’s and whatsapp usage influence on our communication

Microtext became one of the most widespread communication forms among
users due to its casual writing style and colloquial tone [2], plus its exponential growth is highly perceptible. For instance, according to CTIA ,
American people sent 196.9 billion text messages in 2011 compared to 12.5 billion in 2006. Another statistic showed that until May 2016 there were nearly 500 million Tweets sent each day, meaning 6,000 Tweets every second (Source of stats: [3]). Therefore, microtext seems to have
“gone viral’’.

On its own, the prospect of capturing users’ opinions and sentiments is an extremely problematic task which also involves a deep understanding of microtext. Some of the microtext key features are a highly relaxed spelling along with the reliance on emoticons and out-of-vocabulary (OOV) words involving phonetic spelling (e.g., b4 for before)[4], emotional emphasis (e.g., cooooool for cool) and popular acronyms (e.g., otw for on the way).

Handling Normalization techniques

  1. Spelling Correction : Under this first metaphor, correction is executed
    on a word-per-word basis seen as a spelling checking task. This model gained extensive attention in the past and a diversity of correction practices have been endorsed by [5,6]. Alternatively, [7] and [8], both proposed a categorization of microtext (e.g., abbreviation, stylistic variation, prefix-clipping) which was then used to estimate their probability of occurrence. Thus far, the spelling corrector became widely popular in the context of SMS messages, where [9] advanced the hidden Markov model (HMM) whose topology takes into account both “graphemic” variants (e.g., typos, omissions of repeated letters, etc.) and “phonemic” variants (e.g., spellings that resemble the word’s pronunciation). However, all the above work only focused on the normalization of words without considering their respective context.
  2. Statistical machine translation : Going on a different direction, the second metaphor outlooks microtext as a foreigner language that has to be translated, meaning that normalization is done through a statistical machine translation (SMT) task. When compared to the previous metaphor, this method appears to be rather straightforward and better as it becomes plausible to model (context-dependent) one-to-many relationships which were out-of-reach. However, the SMT still overlooks few features of the task, particularly the fact that lexical creativity verified in social media messages is barely captured in a stationary sentence board.
  3. Automatic speech recognition : Lastly, the third metaphor considers
    that microtext tends to be a closer approximation of the phonemic representation of a word when compared to its standard spelling. As follows, the key of microtext normalization becomes very similar to speech recognition which consists of decoding a word sequence in a (weighted) phonetic framework. For example, normalization based on the observation has been observed as text messages present a lot of phonetic spellings. Although the computation of a phonemic representation
    of the message is extremely valuable, it does not solve entirely all the microtext normalization challenges (e.g., acronyms and misspellings do not resemble their respective IV words’ phonemic representation).

Our paper [4] shows that microtext normalization helps to improve sentiment accuracy by 4%. We have discussed the procedures in detail in the paper titled, “Phonetic-Based Microtext Normalization for Twitter Sentiment Analysis”.

I’m currently working on aspects of deep learning in microtexts and will share our ideas with you soon!

References

[1] Pang, Bo, and Lillian Lee. “Opinion mining and sentiment analysis.” Foundations and Trends® in Information Retrieval 2.1–2 (2008): 1–135.

[2] Liu, F., Weng, F., Jiang, X.: A Broad-Coverage Normalization System for
Social Media Language. In: Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics: Long Papers-Volume 1. Number
July (2012) 1035–1044

[3] Brandwatch: 44 Twitter Statistics for 2016 (2016)

[4] Satapathy, R., Guerreiro, C., Chaturvedi, I. and Cambria, E., 2017, November. Phonetic-Based Microtext Normalization for Twitter Sentiment Analysis. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 407–413). IEEE.

[5] E. Cambria and B. White, “Jumping NLP curves: a review of natural language processing research,” IEEE Computational Intelligence, vol. 9, no. 2, pp. 48–57, 2014.

[6] D. R. Recupero, V. Presutti, S. Consoli, and A. N. Gangemi, “Sentilo: frame-based sentiment analysis,” Cognitive, vol. 7, no. 2, pp. 211– 225, 2015.

[7] C. J. Hutto and E. Gilbert, “VADER: A parsimonious rule-based model for sentiment analysis of social media text,” in Eighth International AAAI Conference on Weblogs and Social Media, 2014, pp. 216–225.

[8] Z. Li and D. Yarowsky, “Unsupervised translation induction for chinese abbreviations using monolingual corpora,” in In Proceedings of ACL/HLT, 2008

[9] B. Han and T. Baldwin, “Lexical normalisation of short text messages: Makn sens a# twitter,” in ACL, 2011, pp. 368–378.

[10] K. W. Church and W. A. Gale, “Probability scoring for spelling correction,” Statistics and Computing, vol. 1, no. 2, pp. 93–103, 1991.

--

--

Ranjan Satapathy
Lingvo Masino

NLP advisor and consultant with a Ph.D. and 7 years of experience in building products.