Working with Twitter Data in Python

Published in

Analytics Vidhya

5 min readSep 25, 2019

Twitter is a social network that helps share ideas quickly and concisely. In recent years it has sparked the women rights movements (#metoo, #balancetonporc, etc.), influenced political views and economic outlook. Journalists tend to post a short message on Twitter with breaking news headlines before they publish an article. Twitter shares the tweets via its API services: https://developer.twitter.com/en/docs.html.

So, what happens after you scrap the tweets? It usually depends on your objective (e.g., sentiment analysis, topic modelling, key-words extraction, etc.). But in most cases, you must clean the tweets to reduce noise as much as possible (“text normalization” phase).

Let’s get started. Imagine you got the following tweets scrapped with API:

What do we see here?

LINKS

Assume that the links are not useful for our intended purpose. It is better to remove them in the first place. Regular expression matching usually works well in this case. Your code snippet may look as follows:

import retweet = re.sub(r’https?:\/\/(www\.)?[-a-zA-Z0–9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0–9@:%_\+.~#?&//=]*)’, ‘’, tweet, flags=re.MULTILINE) # to remove links that start with HTTP/HTTPS in the tweet

Working with Twitter Data in Python

Written by Svitlana Galeshchuk