A treasure map to Twitter Data via Tweepy

Published in

Analytics Vidhya

13 min readFeb 19, 2020

Hi there! Twitter data is a true scrumptious source for anyone wants to develop something great, perform data analysis, test SOTA deep learning models or learn data science. For the above mentioned reasons, I’ve been using Twitter API for quite some time. In this article, I wanted to provide a comprehensive guide for people who wants to start with Twitter data. You may enjoy this article even if you’re already into twitter API because I will also talk about the recent policies for applying twitter dev account etc.

This article presents these to the readers:

Creating a dev account on Twitter.
Identifying a real-life problem and find twitter-data ways to solve it.
Grasping the logic behind streaming and collecting past data.
Fetching data from the user timeline with code samples.

Finding the motivation

We all use twitter all day, every day and as technology people, we can’t avoid ourselves from thinking “what would happen if I reach the slightest piece of it”, the motivation here is to analyze something from daily life and gut to apply your technical skills to a real-life problem. So let's motivate ourselves to solve a problem. The problem will be explained in this article will be fetching all tweets by new york times for the purposes of analyzing the celebrities in the headlines of the news. You may of-course come up with a more interesting problem and develop a brighter motivation to solve it.

Creating the Twitter API dev account

Twitter presents several APIs. Each of them is constructed for specific use-cases. Briefly, standard API is free and good for new beginners, data enthusiasts who want to test Twitter, etc. Premium API is one step further, it's for the people who want to experiment more, premium API is more enhanced with options to reach 30 days of data/full archive data. Details of premium API can be found here. Enterprise and Ads APIs for businesses and versatile for the establishments depend highly on Twitter data. As developers, your access will be between standard API and premium API.

Recipe to set up a developer account.

Login with your Twitter account. Your account must be affiliated with a valid phone number.
Go to Twitter’s developer application page.
Click to the Apply for a developer account section.
Choose your primary reason on the next page. Twitter wants to know your ambition to reach its data. Be open and specific here. In the next section, twitter will ask more about your intention. For our fun task in here I chose the Hobbyist section, let's say we will make a bot and explore the twitter API.
In the next page you will see a summary of your account, Twitter just wants to make sure that you are an actual human being and it may ask some questions such as phone number, country, etc if you haven't provided these specific details yet.
The next page is the intended use page. You may think this phase as a visa application(Turkish nationals can relate) where you need to convince the authorities that you are a responsible and decent person and your purpose of taking developer access will not violate the Twitter terms. In the first textbox explain “how do you plan to use Twitter data or the API” try to be explicit here. You may tell them that you will perform a specific analysis on some specific subject and for that, you need to have access to the search API but the data should be the last 1-year-old in order to calculate the popularity of this topic. Remark that it is just a fiction, do not copy it put your own words into it.
In the next text box where they ask you to describe how to analyze Twitter data(if it is your plan), again try to be concrete and open. You may tell them that for a specific political subject you will use search API and according to the tweets, you will report the usage of the exact words in the tweets in a breakdown of hours/days to see if these were relating to a specific political incident, speech, organization, etc. For instance for the word “climate change” you may want to analyze the popularity of the term right after Greta’s speech in the European Parliament. Or the term could be “How dare you”.

8. In the next text box where they ask about the usage of the scope of your API. Write down the appropriate option such as yes I will just use the Tweet functionality.

9. Some individuals or organizations may display the tweets of their official accounts on their websites. If you plan to add such a feature into your website you should elaborate on the usage of twitter data in the “Do you plan to display or aggregate data about Twitter content outside of Twitter” section.

10. In the next section, you will be asked if you share your analysis or, any of the accomplishments gained based on Twitter data with a government entity. Answer this question with an open intention.

11. So much for the application in the next step, you will review your application and expected to read and accept the term of use. Then you can send your application.

If your answers were clear and enough to evoke the impression that you actually have an idea about what you gonna do with Twitter data or at least you are willing to learn, Twitter approves developer accounts within a day normally. If you follow the hints in this article I can't imagine a reason why you wouldn't be approved.

When you are approved you will be notified with an email and you will be able to create an app. You will be expected to create an APP for each of your specific analyses and tasks you perform via Twitter. The APP will have access tokens that you will use them to authenticate in the Twitter API. In the APP creation process, you will be asked for the intent of usage so I suggest you to copy the answers in the application form so you wouldn't have to explain yourself all over again.

The logic behind streaming and collecting past data

I tried to give a little glimpse of what you can achieve using Twitter data, however, this article only mentions a few of them. If you feel like you could use more of imagination to motivate yourself, I highly recommend you to visit Twitter doc for use-cases. There are lots of utilities defined in the API — you can follow or search users as well as the tweets including a specific word or phrase, get information about users such as their followers, manage an account, mute or block users, you can curate a collection of tweets, get tweets from a user timeline, etc. You can check all the functionalities of Twitter API from here. I actually recommend you to stroll around it while you are waiting for your API approval.

Whatever you expect from Twitter data, you have two options when dealing with bunch of tweets: you either go for a timeline of tweets-means you may require some tweets from past, such as in our example with Greta’s speech above, or you can measure the public influence of a certain brand in real-time same as the sentiment analysis mentioned in the use-cases.

Streaming is a bit easier than searching for past tweets, of course, there are some limitations depending on your API. For instance, you may lose some data while searching for real-time tweets using Standard API. To stream twitter data you just have to connect a streaming endpoint, you can think of the entire process as a constant downloading from a window that barely closes. Of course, some times the streaming API has to stop responding it can either be an ill function or exceeding the rate limits. For more details, you can visit here.

Collecting past data, on the other hand, requires you to conduct a historical power track or full archive search. Since power track and archive search are presented in Enterprise API I will leave this link for further reading. Because in our article we’ll take the cheapest way: Fetching historical tweets from the user timeline.

Fetching Tweets from User Timeline

So finally here to talk about the article's main attraction. Return back to our initial problem “Estimating the popularity of celebrities according to their references in the Newyork Times articles.” Lets’ imagine we will fetch the timeline tweets of New York Times’ official twitter, in order to do so what we obviously need is the screen name(how an account appeared in twitter) or the twitter id of an account and a methodology to fetch the tweets belong to this certain user. Let's have a brainstorming about the methodology, we all thought that we need some pagination in order to fetch tweets but how so? Considering the streaming nature of Twitter that there are always new members on the page. How should we perform it? Also, remark the page structure in Twitter is like a stack where the new tweets added to the top while the old ones reside at the bottom aka last in first out. So fetching tweets, by all means, is reading from the top to the bottom. Below is a picture that was taken from the official twitter doc to depict the entire process.

So for each time, we fetch from a user timeline we have a certain tweet that we shouldn't exceed beyond. This is because the upper tweets are added to the stack after we started to the process and our process goes down not upwards. Throughout the section, I’ll also explain how to fetch these tweets in the upper rows. For now, let's think simple as our job is just being able to fetch from the user timeline. Well, one ancient way to implement it, is creating an endless while loop that exits only if there aren't any members left at the bottom or the window size was pushed to its limits. Windows size here is there are 3200 tweets of a user and retweets are counted even if we set include_rts as False. For more information about the limits and usage etc, visit here.

Data from a window can be retrieved by batches, the maximum batch size for this method is 200. So the idea is fetching tweets from top to the bottom, fetching a maximum of 200 tweets in a call and finish the whole process until you reach 3200 tweets(window size). Remark that by the end of the process you may have less than 3200 tweets if you set include_rts param to False but the API will throw you out anyways. Also in each batch, you have to keep the minimum_id because in the next run the min_id presented in the previous batch will be the maximum id of the next batch. Here:

Depicting the window and batches, taken from the twitter official doc

So if the above explanation was clear, we can come up with the pseudo-code of our method such as:

tweet_ids_arr = list
flag keep_going=True
min_id_of_previous_batch = dummy_big_int

while keep_going:
    tweets = fetch_user_time_line(max_id=min_id_of_previous_batch, 
                                  batch_size=20)
    
    tweet_ids_arr.extend([tweet.id for tweet in tweets])
    min_id_of_previous_batch = min(tweet_ids_arr)
    if len(tweets) == 0:
         keep_going = False

Of course, there must be some additions like exception handling and setting up a timer to avoid rate limit errors. We’ll get there soon. Now its time to talk about fetching the tweets in upper rows.

So let's say you have fetched the tweets on February 14, the maximum date is in the queue is February 14, so you need to fetch tweets newer than February 14 but you also have to do it automatically every time you send request to Twitter API. So we’ll need to enhance the usage of max_id with since_id Let's depict the logic in a pseudo-code below:

tweet_ids_arr = list
flag keep_going=True
min_id_of_previous_batch = dummy_big_int
# The max id in stored tweets from a previosly fetch-save
max_id_in_db = max_id_in_stored_tweets
while keep_going:
    tweets = fetch_user_time_line(max_id=min_id_of_previous_batch, 
                                  since_id= max_id_in_db
                                  batch_size=20)
    
    tweet_ids_arr.extend([tweet.id for tweet in tweets])
    min_id_of_previous_batch = min(tweet_ids_arr)
    if len(tweets) == 0:
         keep_going = False

The below picture is taken from the official Twitter doc to hopefully picture the reason behind using since_ids and max_ids together.

In the above, you have processed tweets from 1–10 and while you were processing these new tweets was added to the stack which is 11–18, so giving id_of_tweet_10 as the since_id and id_of_tweet_18 as the max_id is the best practice.

Embracing the cursor utility

Tough I explained all the fetching-user-timeline processes based on explicit parameters and infinite while loops for the sake of clarity, there is actually a more powerful programmatic way of achieving this and that would be the Cursors. Thanks to the tweepy cursors now you don't have to write boilerplate code to manage pagination with loops and send max_id and since_id parameters in each batch, the pagination will be performed by the cursor. For more details please visit here.

This is how cursors work:

import tweepytimeline_ids = list
for status in tweepy.Cursor(api.user_timeline, user_id=user_id, since_id=max_id_in_town).items():
    timeline_ids.append(status.id)

See, the cursor way is much simpler and shorter in lines of code.

Talking code

API authentication and enabling access tokens

Each time you create an APP on Twitter dev account, you will be given user access tokens such as consumer_key, consumer_secret_key, access_token and access_token_secret. These will ensure your secure access to the Twitter API. To reach your credentials go to your app details in Twitter developer dashboard. Click to your app’s name and switch to the Keys and Tokens tab. You will use these credentials during the authentication, here:

import tweepy# Authenticate twitter
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

But since we cannot place the access straight into the code, I suggest to create a credentials.json and create a dictionary that holds them. Here:

{
  "consumer_key":  "#",
  "consumer_secret": "#",
  "access_token": "#-#",
  "access_token_secret": "#"
}

After you created the credentials file you can load it and used them in your code, without exposing them:

import json# Define path to the credentials
path_to_credentials = os.path.join(os.getcwd(), "credentials.json")
# Load the credentials 
with open(path_to_credentials) as file:
    credentials = json.load(file)

Now, authenticate and return the API object

import tweepy# Authenticate with Tweepy
auth = tweepy.OAuthHandler(
    credentials["consumer_key"], credentials["consumer_secret"])
auth.set_access_token(
    credentials["access_token"], credentials["access_token_secret"])
api = tweepy.API(auth)

2. Getting id of a user using the Twitter screen name

In order to fetch tweets from a user timeline, we need to define the user id, though you can use screen_name as the identifier, using twitter_id would be more explicit. For more details please refer to the tweepy doc.

# Fetch twitter user id of an account you know the screen name
user_name = "nytimes"
user_object = api.get_user(id=user_name)print("User screen name is: ", user_object.screen_name)
print("Users id in twitter db is: ", user_object.id)
print("Users info shown on twitter is: ", user_object.name)
print("User was created at: ", user_object.created_at)

user_id = user_object.id

3. Getting the timeline ids with the cursor

Non-commercial Twitter APIs allows to be requested for 15 minutes, and for the next 15 minutes, it expects you to show some manner and live it be. I'm having fun with the manner part but the 15 minutes part still holds. So I added some lines to serve as a stopwatch and some more lines to put my code into sleep for 15 minutes. Below you’ll also see some try-except blocks to avoid the process from quitting when there is Rate Limit Exceed error just in case. You may elaborate/alter the entire process.

import time 
from datetime import datetime# Fetch tweet ids from a timeline 
# Use these block if this is there is no saved tweets from 
# this account or you dont plan to use since_id functionality import time 
from datetime import datetime

counter = 0
timeline_ids = list()
start_time = datetime.now()

try:
    for status in tweepy.Cursor(api.user_timeline, user_id=user_id).items():
        # process status here
        timeline_ids.append(status.id)
        end_time = datetime.now()
        if np.ceil((end_time - start_time).seconds / 60) >= 12:
            print("Worked for 12 minutes, waiting for 15 minutes now")
            print(datetime.now())
            time.sleep(60 * 15)
            start_time = datetime.now()
            end_time = datetime.now()

        counter += 1
except tweepy.RateLimitError as e:
    print("Rate limit error exceed waiting for 15 secs")
    print(datetime.now())
    print("You may want to save the timeline ids over there, to not to lose them during execution")
    time.sleep(60 * 15)
    start_time = datetime.now()
    end_time = datetime.now()

print("You may want to save the timeline ids over there, to not to lose them during execution")

Use this block, if you have the ids from the previous run and you can utilize the since_id.

import time 
from datetime import datetime

# Get the max_id that already  resides in db
max_id_in_town = some_max_id
counter = 0
timeline_ids = list()
start_time = datetime.now()try:
    for status in tweepy.Cursor(api.user_timeline, user_id=user_id, since_id=max_id_in_town).items():
        # process status here
        timeline_ids.append(status.id)
        end_time = datetime.now()
        if np.ceil((end_time - start_time).seconds / 60) >= 12:
            print("Worked for 12 mins, waiting for 15 mins now")
            print(datetime.now())
            print("You may want to save the timeline ids over there, to not to lose them during execution")
            time.sleep(60 * 15)
            start_time = datetime.now()
            end_time = datetime.now()

        counter += 1
except tweepy.RateLimitError as e:
    print("Rate limit error exceed waiting for 15 secs")
    print(datetime.now())
    print("You may want to save the timeline ids over there, to not to lose them during execution")
    time.sleep(60 * 15)
    start_time = datetime.now()
    end_time = datetime.now()
print("You may want to save the timeline ids over there, to not to lose them during execution")

4. Getting the tweets affiliated with the ids

Now, we have all the tweet ids so we can fetch the extended statues finally. Here:

# Set looping params
start_time = datetime.now()
raw_tweets = list()

for id_ in timeline_ids:
    try:
        res = api.get_status(id=id_, tweet_mode="extended")
        raw_tweets.append({"accident_id": id_,
                           "created_at": res.created_at,
                           "text": res.full_text})
        end_time = datetime.now()
   
        if np.ceil((end_time - start_time).seconds / 60) >= 12:
            print("Worked for 12 mins, waiting for 15 mins now")
            print(datetime.now())
            print("Save tweets somewhere")
            time.sleep(60 * 15)
            start_time = datetime.now()
            end_time = datetime.now()
        del res

    except tweepy.RateLimitError as e:
        print("Rate limit error exceed waiting for 15 secs")
        print(datetime.now())
        print("Save tweets somewhere")

        time.sleep(60 * 15)
        start_time = datetime.now()
        end_time = datetime.now()
print("Save tweets somewhere")

Here is the end of this article, comment if you have any suggestions. Let me know if you want to learn more about how to store and analyze these tweets.