Feature Extraction from Text (text data preprocessing)

Adarsh Verma

Published in

Deep Data Science

5 min readApr 27, 2019

[ This is part of 100 Days of ML ]

0) About

The main objective of this post is to explain feature extraction from text. The dataset used in this project is the tweets of thousands of users on the trending topic of — #AvengersEndgame. This data set contains around 10 thousand tweets scraped from Twitter. Code for this project can be found at the end of this post. Dataset can be found here:
Dataset — https://www.kaggle.com/kavita5/twitter-dataset-avengersendgame

Tweets are the text with rich features like keywords, usernames, sentiments etc.. Here’s a look at the tweets:

RT @HelloBoon: Man these #AvengersEndgame ads are everywhere https://t.co/Q0lNf5eJsX
RT @Marvel: We salute you, @ChrisEvans! #CaptainAmerica #AvengersEndgame https://t.co/VlPEpnXYgm
RT @MCU_Direct: The first NON-SPOILER #AvengersEndgame critic reactions are here and nearly all are exceptionally positive, with many prais…
RT @Renner4Real: Ready to rock  !   #excited #avengersendgame #presstourcontinues #worldpremiere #endgame https://t.co/KXpKNJl9aq
RT @Avengers: We’re with him ‘til the end of the line. #WinterSoldier #AvengersEndgame https://t.co/Xi4cYqWgDR
RT @Variety: #AvengersEndgame first reactions: 'Most emotional, most epic MCU film' https://t.co/w4cojZzhPl
RT @HelloBoon: Man these #AvengersEndgame ads are everywhere https://t.co/Q0lNf5eJsX
RT @Avengers: Destiny has arrived, Josh Brolin! #Thanos #AvengersEndgame https://t.co/klb2Zrk0pr
RT @Marvel: We salute you, @ChrisEvans! #CaptainAmerica #AvengersEndgame https://t.co/VlPEpnXYgm
RT @itsjustanx: Scarlett Johansson and Brie Larson will defeat Thanos. #AvengersEndgame https://t.co/LFBfYk0yxV
"RT @softerstark: heyy y'all please share these pics and make sure to use the tag!

Feature Extraction

Number of keywords — Keywords are powerful words and are used for specific purposes. They also give some ideas about the text. We can extract the count of keywords from the text and also the keywords. This feature could be helpful in many ways, one example — if you are classifying the tweets with and without keywords or want to classify on the types of the keywords. We will build a keyword extract from tweets in this post.

2. Number of users tagged — When there are users tagged in the tweet, it means that the tweet is a conversation between two parties or some kind of acknowledgment or it could be a medium of gaining attention. Users tagged can be calculated by counting the words that start with @ in the tweets. Make a new feature in the data frame and store the user count. Usernames extractor program can be found here.

3. Number of numerical values — Numeric values in the tweet could mean it contains things from year to money. Depends on the problem, it could or could not be useful.

4. Number of UPPERCASE words — UPPERCASE words could be abbreviations or used to express excitement like OMG, TGIF, TTYL etc. , or anger or rage. Which makes this a necessary operation to identify these words. With the help of sentiment score, we can classify the tweets into different categories of emotions.

Before moving further, we need to clean the data a little bit to remove, punctuations, “\r” , “\n” characters.

Text Cleaning

We are gonna keep the words and spaces and remove everything else for further feature processing, but this step should be done after feature extraction like hashtags, user tagged because this step will also remove ‘#’ and ‘@’.

5) Average word length — Different kind of text uses different kind of words, scientific reports has usually higher average word length than the normal conversation or daily news. So average word length could help for differentiating the type of text.
Average word length = Sum (length of all the words in the tweet or doc ) / total number of words in the tweet or doc)

6) Number of words — Basic idea is to extract number words in each tweet and use the count as a feature. The main intuition behind this technique is — some text needs more words than others to express itself. Like, when people are happy they express themselves more than when people are angry.

7) Sentiment score calculation
Polarity, subjectivity, and intensity can be calculated from the tweets and can be used as features. In this project I calculated the polarity and sentiments from the tweet:

8) Bag of Words and TF-IDF calculation

Bag of words is the technique to use the word’s frequency in a doc/sentence/tweet as a feature. TF-IDF is a better technique than the bag of words which considers the importance of the word than the only frequency of the word.

These features need more text preprocessing so we will discuss them in detail in the future post.

Code can be found here.

#NLP #textprocessing #featureextraction #machinelearning

Feature Extraction from Text (text data preprocessing)

Written by Adarsh Verma