Keywords & Usernames Extraction From Tweets [Text Mining]

Adarsh Verma
Deep Data Science
Published in
3 min readApr 28, 2019
Photo by José Alejandro Cuffia on Unsplash

[ This is part of 100 Days of ML ]

Keywords and usernames could be very helpful in many different ways. In this post, I will explain their usefulness and how I built a small keyword and username extractor program in python. The code can be found at the end of this post.

I have used a dataset which is tweets of thousands of users on the trending topic of — #AvengersEndgame. This data set contains around 10 thousand tweets scraped from Twitter. Code for this project can be found at the end of this post. Dataset can be found here:
Dataset — https://www.kaggle.com/kavita5/twitter-dataset-avengersendgame

A look at the tweets from the dataset:

Keywords Extraction — Keywords are important words which provide information about the text. In our dataset, the keyword #AvengersEndgame means the tweet is about the movie Avengers Endgame. Similarly, the keywords could be used for market research, find out trending topics, what people are gossiping about etc. When used together with the sentiment score, it can find out what people think about and how they feel about a certain product or company? From example, if a tweet contains #AppleWatch and the sentiment of the tweet is positive, then there are pretty good chances that the user is feeling and saying something good about Apple’s Watch. On the other hand, if the sentiment score is negative then the user is feeling and saying something bad about the watch.

In the tweets, keywords can be easily found with the hashtags. With our dataset, I dug down in every tweet and extracted the words which started with the ‘#’ character. The keywords, then stored in the data frame by creating a new column. Here’s the code and a look of the new column:

Usernames Extraction — Very Often tweets contain usernames, which means that the tweet could be a conversation between two parties or some kind of acknowledgment or it could be a medium of gaining attention. It totally depends on your use case and how you want to use it but usernames could be a very useful feature. More information can be fetched with the help of usernames and can be used for finding out the interests of people. One pretty good application of this is in marketing. For example, if someone follows Twitter Marketing (username — @TwitterMktg) that means that the user is interested in the information and news from Twitter’s marketing team.

As the usernames start with ‘@’ character, I fetched all the words which start with this character and they will all be the usernames. Here’s the code and a glimpse of the usernames:

More can be done with this strategy:

  • Links extraction — Links of the websites can be extracted from the tweets in the same way described in the code (link can be found at the end). Usually, people try to promote their products or brand while including the link in the tweet.
  • Abbreviations — Words like OMG, TBH (To Be Honest), AMA (Ask Me Anything), IMO/IMHO (In My Opinion / In My Humble Opinion), NSFW (Not Safe For Work) are short but powerful. Each of these words describes a lot about the text and something about what the user wants to say? So try building your won abbreviation miner.

There are many other features can be extracted from the text like word count, average word length, sentiment core. Find how and why here.

Code can be found on the Github.

--

--