0 to 280: Getting Twitter timeline data for your awesome NLP project

Part 1: Five steps to transforming any Twitter user’s timeline into a csv or dictionary with Python

If you enjoy Natural Language Processing like I do, you might have some ideas for analyses you could do using Twitter. What are the word patterns and hashtags that the tweeter uses? How do these change over time? What sites do they link to? What words appear around others? Who or what do they talk about?

This is part 1 of a 2 part project to show you how to:

  • Obtain and clean the data from a Twitter user’s timeline
  • Export into a csv file and a Python dictionary
  • Apply analysis tools from NLTK to get summary data and explore hypotheses about the user’s tweets
  • Use Python libraries to visualize the data you collect

In this tutorial, we’ll cover the first two items from the list above in five steps. I’ll show you how to gather any user’s timeline using just Python and a few libraries. Then, we’ll get ready to do the fun part of the project — the analysis — by cleaning the data and putting it into your choice of a csv file or Python dictionary. Which is also sort of fun too.😉

Technical notes 💻

I’m using Python 3.6 in this project, as well as an Anaconda virtual environment. To complete the steps below, you should already have Python 3.6 and Anaconda installed and the virtual environment activated. You will also need a Twitter account and an internet connection to complete this workbook.

Since I code on a Mac I use some terminology that doesn’t translate to a Windows machine in places; terminal instead of shell for example.

If you’re not too familiar with Jupyter Notebooks, they are awesome, and this quick start guide is well worth your time. Knowing the basics will help you through this notebook and be a useful tool when you do your own projects!

All of the code shown in this tutorial lives on Github. The Jupyter Notebook in the repo includes all of the text from this blog post, with some slight modifications.

Step 1: Install the required packages 👩🏻‍💻

From your virtual environment, you can run the requirements.txt file in the repo, rather than searching out and installing each library separately. Even if you don’t use the Jupyter Notebook where the code is located, I recommend cloning the repo and just using the parts you want.

To install packages from the requirements file, go to the folder you’ll be working from, and run the command pip install -r requirements.txt.

Once that’s installed, type jupyter notebook twitter_timeline_to_csv in your terminal, and the notebook will open in the browser.

Running the cell below in the notebook will import all of the needed packages into your script and make them ready for use.

Step 2: Get ready to talk to the Twitter API 📞

If you don’t already have a Twitter account, you can create one here.

If you do have a Twitter account, head over to the Twitter API documentation — to get the keys needed to access the API, you’ll need to create an application. Navigate to the apps page while signed into your account and click on the Create New App button on the top right.

Here, you’ll need to get four things that will identify you and give you access to the Twitter API:

From the top of the page:

  • Consumer key (API key)
  • Consumer secret key (API secret key)

From the bottom of the page:

  • Access token
  • Access secret token

If you are going to put your code online anywhere, you should set environment variables for all four of these values; pasting them directly in your code is likely to result in you accidentally committing them and pushing them into your online repo at some point. If someone gets a hold of these values they can access Twitter’s API as you, which you don’t ever want.

Instructions for setting environment variables in an Anaconda virtual environment are here.

The code below accesses the virtual environment variables I’ve created to represent my consumer keys and access tokens and saves them as a variable in my code. If you copy this workbook and run it locally on your computer, change the variable names in the quotes to the names you’ve given them in your virtual environment.

Now that’s done, we’re going to use the python-twitter library to access the Twitter API. I find this a lot easier than writing them from scratch.

We imported the python-twitter libary as import twitter above. It shows up in the requirements.txt file as python-twitter==3.4.2, and you can read the documentation [here] (https://python-twitter.readthedocs.io/en/latest/), which also happens to have a great step by step on getting started with the Twitter API!

The first thing will do is create an object that will hold all of the credentials for accessing the API that we defined above.

That’s it! You’re ready to talk to the Twitter API!

Step 3: Get some Tweets! 📦

Now that we can access the API, we can get some data.

If you haven’t already, choose a Twitter account you want to analyze.

Next, you’ll need to create a variable to hold that account’s Twitter handle. Input this in the cell below without the ‘@’ (e.g. Use NASA, not @NASA). If you’re starting with the URL of the account, take only the handle after the last ‘/’ (e.g. From https://twitter.com/NASA, use ‘NASA’.

Now that we have that, let’s get some data! We’ll call the Twitter API with the GetUserTimeline method on our t object that we created in the last step.

Notice that we are passing in the screen_name variable, and setting the count variable to 200. You must have the screen_name to make the API call and if you omit the count variable, you'll get the latest 20 tweets back by default. You can't get more than 200 tweets back at a time.

This first_200 variable holds a list of 200 tweet objects representing the last 200 tweets posted to the NASA Twitter account.

But we want more data!

We already know we can only get 200 tweets at a time. The other constraint we’re working with, at the time of this writing, is that the Twitter API is rate limited, meaning Twitter puts restrictions on how much data you can take at a time. For my app, I’m rate limited to 900 requests — or batches of 200 — every 15 minutes. So, what I want to do is write a function that will make those 900 requests, each time getting the 200 tweets posted before the last batch.

To do this, I wrote the get_tweets function, below. Here’s what it does:

  1. Takes the first_200, and screen_name variables as arguments, as well as something called last_id. As you'll see when we call the function, this is the ID number of the last/oldest tweet in the first_200 list.
  2. Makes a new list and adds the first_200 tweets to it
  3. For 900 iterations (because of my rate limit):
  • Gets 200 of the user’s tweets, starting with a max_id one smaller than last_id in the previous list of tweets; the new variable this data is stored in gets overwritten each time
  • Adds the list of tweet objects obtained to the all_tweets list
  • If there’s anything in the list (e.g. if we got any data back), grab a new last_id value to feed into the API call in the next iteration.

❓Why is max_id set to last_id MINUS 1?

In the 4th line of the get_tweets function, the max_id is set to last_id-1. Why not just set it to last_id?

Each tweet has a unique whole number, stored as the ID attribute; in the sample printed above it looks like this: "id": 1013529203990581249. If we set the max_id to that number, the first tweet we get in the next 200 tweets will be that exact tweet, meaning it will appear twice in the dataset. By subtracting one from this ID, we are saying the max ID can be anything less than the last tweet in our previous list of 200, which is exactly what we want.

Let’s call the function. We’re passing in first_200, the list of the 200 most recent tweet objects, our screen_name variable, and the last ID number in that first_200 list. This mega list will be called all_tweets .

When I ran this function on July 1, I got the following results — yours will vary slightly.

There are 3245 tweets stored in a list as the all_tweets variable.
The most recent tweet in our collection was sent Sun Jul 01 23:04:00 +0000 2018 and the oldest tweet was sent Wed Oct 11 15:31:37 +0000 2017.

Step 4: Clean the data! 🛁

In this step, we’re going to condense and clean our data to get it into a more analysis-friendly format.

First, the condensing part. We need to decide what’s important for the analysis.

Let’s take another look at the data that comes back for with each tweet from the Twitter API. There’s a lot here, much of it not specific to the tweet itself. For example, since we’re grabbing all of the tweets from a single person’s timeline, the whole "user" attribute isn't very useful to us for this analysis, as it's repeated every time.

In this tutorial we’ll focus on keeping and cleaning the following attributes, but you can choose your own and modify the code:

  • id: the unique identifier for the tweet
  • created_at: when the tweet was sent
  • full_text: the text included in the tweet
  • hashtags: the hashtags (e.g. "#space" appears as "space") included in the tweet
  • urls: the expanded version of urls included in the tweet (e.g. "https://t.co/sYCFHKxzBf" is the shortened URL in the tweet but we'll get the full url of what it points to, https://twitter.com/NASA/status/1013529203990581249/video/1)
  • favorite_count: number of times the tweet was favorited
  • retweet_count: number of times the tweet was retweeted
  • source: from what platform/app the tweet was posted

One problem is that the data we get back isn’t totally clean, so we need to process it a little bit first. Here are some examples from the data above that could create problems for us later because they include formatting we don’t need. In every case except the created_at attribute, we want a string, or list of strings, with just the important parts; we don't need <a href="https://www.sprinklr.com" rel="nofollow">Sprinklr</a>, just Sprinklr.

For the created_at attribute, when using it in Python, we'll want to convert it to a datetime object.

Let’s take a look at some fields that need cleaning.

To clean these fields up, I wrote the following helper functions — we’ll see them used in a moment.

Step 5: Reformat the data as a csv or Python dictionary 🗃

This is the last step of this tutorial, where I’ll show you how to get this data, which we now know how to retrieve and clean, into a format that we need to start the analysis.

I’m showing both how to make this into a csv and a Python dictionary because:

  • A lot of people like seeing the data all at once as a csv — there are neat ways to print it in Python but this is easier to absorb for a lot of people
  • csv is a great format to use if you need to share the data with people who don’t program, as they can open it up in any notepad or spreadsheet program, no coding required
  • If you want to continue the analysis with Python but in a different project file, you can always use the dictionary or read in a csv in a couple of lines of code

So let’s start with the csv file! The write_to_csv function will create the file and pull the data we want out of the all_tweets list of tweet objects we made in step 3.

Now we can call the function, passing in the all_tweets list. I’ve set this up so that the the csv’s file name will be the screen_name variable defined in step 2 with “_tweets” after it (e.g. NASA_tweets.csv), but you can change it to whatever you like.

Now you should see your csv file in the same directory as this notebook! Check that it has all of the right fields and the right number of rows; there should be one for every tweet in the list plus a header row.

You can see the file, current as of July 1, 2018, on Google Drive here, as well as in the Github repo.

If you’re more intersted in continuing to analyze the file in Python, I’d recommend putting the data in a dictionary. Parts of the function I’ve written to do this — create_dict below — is pretty similar to write_to_csv, although notice that we need to put each one under a key (here I’ve chosen the string version of the ID field). Notice I’m running all four of the cleaning functions from step 4 on the data as I add it.

Let’s run it, passing in the all_tweets list.

We can take a look at a single entry by using a string containing the ID of a tweet as the key.

That’s it for now! Watch for part two, here on agatha.codes and on Github very soon.

As always, please add feedback and questions in the comments section.🎉