NLP feature extraction from LIWC in Python

Published in

Bright AI

3 min readOct 28, 2022

What is LIWC and why is it important?

LIWC (Linguistic Inquiry and Word Count) is a lexicon used in Natural Language Processing to extract not only the emotions, and sentiments behind texts but also understand a wide variety of psycholinguistic characteristics of people just from the text. LIWC is a helpful tool is extracting minds and behavior of people behind the text

For instance, are people talking about work or family? Does their text show certainty/trust or are they talking about a vacation or achievement? All these subtleties can be captured using the LIWC lexicon (dictionary), a corpus of English words that put words into different categories or buckets.

How LIWC is more than just sentiment extraction

Extracting features using LIWC can be very valuable in Machine learning classification tasks where 2 classes sound really similar but yet are different. For example, The following both of the following texts are positive sentiments but LIWC gives us a deeper level of understanding of their difference

“Congratulations! you must be a happy sister” → family

“Congratulations! you must be a happy boss” -> work

There are 69 different categories that make LIWC a gold for linguistic feature extraction. Some of these categories are:

Implementing LIWC feature extraction in Python

Step 1: As in the code below, Install LIWCand import the required libraries

Step 2:Read the text dataset and clean it and save it as a separate column clean_text

Step 3: Tokenize clean dataset as df_train["tokens"]

Step 4: Download LIWC2015_English.dic . LIWC license can be purchased and the LIWC Lexicon can be found as discussed at https://github.com/chbrown/liwc-python/issues/5. If you would like to use 2007 version of LIWC it can be found in https://github.com/nikiparmar/Twitter-Sentiment-Analysis/blob/master/LIWC2007dictionary%20poster.xls

Step 5: Parse LIWC2015_English.dic tokens and counts these tokens present in your dataset df_train["tokens"] for one category . In the below example, it is family(Family)category . This will count words related to a family like a son, daughter, thanksgiving, etc. Similarly, other categories of the LIWC lexicon can be extracted from df_train["tokens"]