Retrieving Twitter Data

Sofia Rubio-Topete
6 min readOct 16, 2019

--

For this module, I was really interested in exploring different methods of retrieving twitter data from the web and structuring it so that it would then be analyzed. I decided to Explore Elon Musk because he has an active twitter account with many followers and lots of engagement. My goals for this module were to understand the capabilities of the twitter library, and exploring the Twitter API to pull data from the web and create my own data set to aggregate and analyze. How can I merge different columns using the Twitter API to create a dataset? What is Elon Musks’ most popular tweet? How has his twitter usage changed over time?

First, I created a developer account on twitter. Once it was approved, I created a new app in my Twitter Developer account, which contained the credentials to authenticate the twitter library in Jupyter notebooks.

To authenticate the credentials from the app I created, I defined them at the top of my code.

api_key = ‘ ’
api_secret = ‘’
access_token = ‘’
access_secret = ‘’

The following block of code actually performs the authentication which allows you to access the twitter API.

Next, I wanted to get the tweets from Elon Musk’s Twitter account. I was able to do so by calling.user_timeline, and specifying a count to only pull a few tweets to see what we were working with.

The code printed was all in one block, and so I was used JSON to make it easier to read, and so that I could get a more proper format. I used ._json to get the JSON of the first tweet, by specifying the index [0]. Calling the dumps function from the JSON library takes JSON and converts it into strings that can then be printed out. Adding an indent argument adds indentation to make it easier to read.

We can use this function to create an “extractor” object to extract tweets. Extracting the tweets will let us create an organized and structured dataset that we can then analyze more closely. This method also allows us to format the output.

To store these tweets into a dataset, I used pandas to create a data frame containing the tweets.

Given that the twitter API also contains so much additional information on these tweets, I decided to add some of these additional values to the dataset. To find out the internal methods of a single tweet object, I wrote print(dir(tweets[0])), which returned different attributes. I took some of the relevant points of data such as likes, date, and ID and added them to the data frame.

Now that the data frame was organized and had some important data points, I was able to use numpy to get some more statistical data so that I could visualize it late on. For instance, the average length of the tweets, numbers of lives, etc.

The tweet with more likes is: 
Haha https://t.co/hY3yDsrVqk
Number of likes: 316220
28 characters.

The tweet with more retweets is:
Haha https://t.co/hY3yDsrVqk
Number of retweets: 63937
28 characters.

Now that I was able to get the tweets from Elon Musks' account, I was curious as to what other people were tweeting at Elon Musk. To search Twitter for Tweets, I began by finding recent tweets that use the #elonmusk hashtag. The .Cursor method allows one to get an object containing tweets containing the hashtag. To create this query, I defined the search term and the start date. I was able to restrict the number of tweets returned by specifying a number in the .items() method. .Cursor() returns an object that you can iterate or loop over to access the data collected. Each item in the iterator has various attributes that you can access to get information about each tweet including:

  1. the text of the tweet
  2. who sent the tweet
  3. the date the tweet was sent

The code below loops through the object and prints the text associated with each tweet.

The above code uses a standard for loop. Below, I included the code for using a Python list comprehension. A list comprehension provides an efficient way to collect object elements contained within an iterator as a list.

One thing I wanted to remove from my data was retweets. I was more interested in seeing tweets from Elon Musk or containing Elon Musk than retweets. The Twitter API documentation has lots of information on how to customize queries. I was able to ignore all retweets by adding -filter:retweets to my query, see the code below.

After getting the tweets that included #elonmusk, I wanted to get more information on those users. One of the questions that came about while going through this data was where are most users tweeting from and are there any potential megafans? I created a new query that pulled the screen name and user location for the tweets that contained #Elonmusk. Again, I used .items(30) to only have 30 items returned.

To make a data set out of this data, I used the DataFrame method. Putting this data into a data frame will make it much easier to analyze and create visualizations with.

The Pandas library has its own object for time series. Given that there is a vector with dates that tweets were made, we can further visualize the time series with respect to tweets lengths, likes, and retweets.

Overall, I found this module to be really interesting. I enjoyed working with the twitter API and also found the documentation to be very helpful. One challenge I had was that since there are so many different ways to pull data from this API, I at times struggled with applying the documentation to my code, given the way it was structured.

It was really interesting being able to see how the tweet with most likes is the same as the tweet with the most retweets:

Number of likes: 316220
Number of retweets: 63937

In the future, I would like to continue this analysis by looking for correlations between people tweeting at Elon Musk, and his tweets. For instance, I wonder if the release of new products or news stories posted on Elon Musk also had a spike in tweets mentioning #elonmusk. I also would like to work more with visualizations of geolocation, specifically exploring how to visualize users' tweet locations that include #elonmusk.

--

--