Collecting Tweets with Python

Lucas Martiniano
Analytics Vidhya
Published in
4 min readMay 27, 2021

Grep, process and store Twitter data using Tweepy Python module

Tweet app icon
Photo by Brett Jordan on Unsplash

Twitter is a world wide densely used channel for sharing thoughts, opinions and experiences. Making this web site a great source of media and text content which is useful data for analyzing and taking insights.

Furthermore, there is a Twitter feature that offers the possibility to grep tweets about certain subject, tracking data related to some words and, then, obtaining information about trend topics, persons, hashtags or any other theme.

In this article, it is described a way for consuming this feature using the programming language Python through the library Tweepy.

Tracking

In order to access Twitter data by code, it is necessary to apply for Twitter Developer to get your own API keys. This process is a little time consuming but is required to proceed.

To start coding, create a Python script file and set the variables below using your keys.

CONSUMER_KEY = 'XXXXXXX'
CONSUMER_SECRET = 'XXXXXXX'
ACCESS_TOKEN = 'XXXXXXX'
ACCESS_TOKEN_SECRET = 'XXXXXXX'

Tweepy

There are a lot of possible ways for accessing Twitter API with Python. In this article, Tweepy library will be used. To install this Python module with pip. Run:

$ pip install tweepy

Then, import Tweepy module and apply your keys to authentication, creating a Twitter API object that allows the access.

import tweepyauth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

Streaming

Using Tweepy module, it’s possible to access and customize the tweet streaming feature, which is useful for obtaining a very high volume of tweet data, since it returns real time published tweets.

Setting tracking behavior

In order to be able to define what the program will do whenever a tweet is published, it’s required to create a class that extends StreamListener from Tweepy and override on_status method to add the desired behavior. Below is an example for just printing tweet text.

class TweetListener(tweepy.StreamListener):
def on_status(self, tweet):
print(tweet.text)

Tweepy offers a class called Stream that requires authentication and a listener to be instantiated. So, create a Stream object that receives the auth attribute from api variable defined earlier and uses an instance of the above TweetListener class.

listener = TweetListener()
stream = tweepy.Stream(auth = api.auth, listener=listener)

Start stream

There are several streaming process available through Tweepy. To start streaming tweets, you can use the filter process available through filter method of the stream object. With it, it’s possible to track tweets containing a list of words or follow tweets from multiple users and even select the languages that will be considered.

The code below, for example, starts printing tweets wrote in english containing words related to COVID-19 (“coronavirus”, “covid”, “covid19”, “covid-19”). Notice that this is just an example, feel free to change filter parameters.

# filter parameters
words = ['coronavirus', 'covid', 'covid19', 'covid-19']
languages = ['en']
# streaming...
stream.filter(track=words, languages=languages)

By now, the script is only printing tweets. Once started, it won’t end until be manually stopped (pressing CTRL + C or killing the system process) and it will not record any information. Thus, for further analyses, it’s necessary to label and store the data.

Auto cancel

A way to archive the recording feature is updating the TweetListener class, setting up a list attribute that is filled by on_status method. Since the streaming process is infinite, it’s also required to set a threshold that will automatically cancel the stream by returning False on on_status once it’s reached.

# set default threshold value 
DEFAULT_THRESHOLD = 10
# older listener with changes
class TweetListener(tweepy.StreamListener) :
def __init__(self, threshold = DEFAULT_THRESHOLD) :
super().__init__()
self.threshold = threshold
self.tweets = []
def on_status(self, tweet):
if len(self.tweets) < self.threshold :
print(tweet)
self.tweets.append(tweet)
else:
return False

Labels and fields

A single tweet carry a lot of data, such as content text, media, favorite count, owner and so on. For more details, take a look on this page about the Tweet object at Twitter Developer docs. Every applying case requires different information, choose the interesting fields for your case and discard what is left.

It is important to mention that if tweet text exceed 140 characters, the text attribute will be truncated. In this case, the tweet object will have the extended_tweet attribute. So, to access the full text, use extended_tweet[‘full_text’].

# older listener with changes
class TweetListener(tweepy.StreamListener) :
def __init__(self, threshold = DEFAULT_THRESHOLD) :
super().__init__()
self.threshold = threshold
self.tweets = []
def on_status(self, tweet):
if len(self.tweets) < self.threshold :
text = (
tweet.extended_tweet['full_text']
if hasattr(tweet, 'extended_tweet')
else tweet.text
)
desired_fields = [tweet.id, text]
print(desired_fields)
self.tweets.append(desired_fields)
else:
return False

Storing

By now, all tweets tracked are stored at tweets attribute inner TweetListener object. We can use Pandas to create a DataFrame and save it in a CSV file:

import pandas as pdcolumns = ['id', 'text']
output_file = 'tweets.csv'
tweets = pd.DataFrame(listener.tweets, columns=columns])
tweets.to_csv(output_file, index = False)

Next Steps

Try out applying the text data you’ve stored to analyze and take information from it. For example, study mentions to users in tweets talking about certain key subject, the common hashtags, the sentiment of the message and so on. As it was said earlier, Twitter is a great source of data. You can grep a real big amount of messages in a matter of time.

--

--