Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Mining replies to Tweets: A Walkthrough

7 min readJan 16, 2021

--

A snapshot of my “news” from twitter

Twitter’s API is loved by all of us, isn’t it? It gives us access to a treasure trove of information, misinformation, disinformation and whatnot- almost in real time. It feeds the beginner NLP Engineer/Data Scientists dream of collecting processable data efficiently. And it is ridiculously well documented by enough of the community through their blog posts, stack overflows, and wrappers. Last I counted from memory, I recollected at least five different python wrappers to access the API.

Despite all these, I stumbled upon a task that had poor documentation and even poorer solution implementations (not the fault of the programmers): Mining replies to a tweet.

For the rest of this post, I am going to assume that the reader is familiar with twitter API terminology such as a tweet_id or a tweet object. If this is new, please do refer to the twitter's documentation.From experience, tutorials that use wrappers such as tweepy make it more intuitive to get started- for advanced usage however, the source documentation might help.

This is a solution to this problem. So is this. Masterful implementations. And they work in most cases. Unfortunately, they were limited by Twitter itself. If you don’t want to dive into the code in these links, here is a quick overview of the process they follow:

Mining replies to tweets1. Find the tweet_id of the tweet you want the replies to.
2. Query twitter for tweets that have replied to this tweet id. This can be done based on the in_reply_to_user_id attribute that every tweet object carries. Every tweet is represented as a tweet object.
3. Compile the replies, assert their correctness. Done.

The drawback however, is that when you retrieve these results, Twitter’s old API only returned recent or popular results- or a combination of both (you can specify which). And these recencies or popularities aren’t adjudged per tweet- they are adjudged by the user who posted the root tweet you want replies to.

When you interact with my account, you can find almost every reply I have ever received by either metric. If you interact with The New York Times’ account- you cannot even recreate replies to something they post a few hours ago because of the sheer amount of “recent” and “popular” tweets they receive. Since only a certain number of replies are returned, recent replies and popular replies keep changing rapidly making it is impossible to recreate reply activity for high activity accounts.

Not all’s lost. In fact, it’s better now!

By a stroke of luck, I realized that Twitter launched it’s v2 API late last year. “It’s just updated- nothing too fancy I guess” (except that is the most wrong I have been ̶ ̶i̶n̶ ̶l̶i̶f̶e̶ when it comes to APIs).

Here is a nice little post explaining what’s new.

I couldn’t find many python wrappers (I didn’t like my prior experiences in trying to interact with the API without a wrapper) using the v2 endpoints, but by sheer coincidence and the grit of a desperate last attempt- I found the saviour wrapper: TwitterAPI.

Of all the features that are interesting in the new API, the one that matters the most to this task is the attribute conversation_id that now comes with every tweet. In a convoluted thread of tweets, with replies, replies to replies, replies to replies which are also replies and so on, every tweet is allocated the same conversation_id to try and thread everything together.

TwitterAPI’s example code (which I leant heavily on and modified a bit for this post) relies on this very attribute. Let’s code it up stepwise. You can copy and paste the code and it will work as described. Remember to set up the authentication tokens from your own twitter developer portal.

NOTE: The gists are not meant to function as independent .py files. They are just meant for representation. You can copy the code from here sequentially as described to recreate the file however (or use it in a notebook form here).

1. Import the magic.

# INSTALL TWITTER API (!pip install TwitterAPI) and import necessary libraries/packages
# TwitterAPI Documentation here:
https://github.com/geduldig/TwitterAPI/
from TwitterAPI import TwitterAPI, TwitterOAuth, TwitterRequestError, TwitterConnectionError, TwitterPager
import pandas as pd

2. Authenticate your app. Get your tokens here. Remember to pass the version=2 argument when instantiating the API.

consumer_key= "CONSUMER KEY"
consumer_secret= "CONSUMER SECRET"
access_token_key= "ACCESS TOKEN KEY"
access_token_secret= "ACCESS TOKEN SECRET"
api = TwitterAPI(consumer_key, consumer_secret, access_token_key, access_token_secret, api_version='2')

3. Figure out the “conversation_id” of the tweet. Here is what I did and I found this the most convenient.

  • Find recent tweets from the account that posted the tweet you want replies to. Use tweepy to collect the tweet objects- they use the v1 API, but to retrieve tweet objects and therefore tweet_ids- theirs is the simplest and fastest way in my opinion (but you can find it any other way too- we just need the tweet_ids at the least- conversation_ids at most)
  • Lookup the tweet using its tweet_id using the TwitterAPI and retrieve its conversation_id.

4. Retrieve the tweets as describes above. With the code, you should have their “conversation_ids” as well, all neatly packed into a dataframe.

# RETRIEVE TWEETS FROM THE SCREEN NAMES(S)tweets = get_new_tweets(names)# RETRIEVE CONVERSATION IDs OF THE RETRIEVED TWEETS
tweets = add_data(tweets)
# WRITE THE RETRIEVED TWEETS TO CSV
tweets.to_csv("tweets.csv")
# VIEW THE DATAFRAME HEAD WITH THE TWEETS RETRIEVED
tweets.head()

5. Go to the tweets.csv file and find the “conversation_id” of the tweet you need the replies to. Copy it. Come right back.

conv_ids = ['1349843764408414213'] 
# replace this with conversation IDs from above- this is just a place holder

6. Build a data structure to handle the tweet hierarchy.

I am not going to dive technically into this step in a detailed manner. All that is happening here is that a tree data structure is created to keep track of tweets where root is the first tweet that starts a thread.

(Optional logic: Every reply to the root is a level 1 reply. Every reply to a level 1 reply is a level 2 reply and so on. The interesting aspect to note is that every single tweet in the whole thread carries the conversation_id attribute. Using this and using in_reply_to_user attribute, you can recreate the level of every reply in the thread. Using these and the timestamp of each tweet, you can practically recreate the whole thread.)

7. Retrieve the replies actually.

We simply use the tree structure we create above to store every reply and find its position in the thread hierarchy by allocating every reply a level.

I have simplified this post a bit by adding some code that retrieves only immediate replies to a tweet and not subsequent replies to replies. If you want to see the whole conversation tree being recreated, do uncomment lines 27 from the gist above, lines 37 and 38 from the following gist, and comment line 36 from the gist below to watch magic happen.

Optional Logic: Once you obtain the root and a list of tweets with the same conversation_id, all you need to do is use this and use in_reply_to_user_id to find the parent of every tweet until you reach the root by brute force. However, a recursive implementation would be far more efficient. The function find_parent_of () does exactly this.

Run this code using:

replies = reply_thread_maker(conv_ids)# WRITE REPLIES TO FILE
replies.to_csv("replies.csv")
#VIEW SAMPLE REPLIES
replies.head()

8. Go to replies.csv and analyze your replies. Fin.

I have attempted to put this methodology on a colab notebook to make it function like a web based tool here. It will still need some programming knowledge and understanding of APIs, but I have attempted to make it as plug and play as I could without making an entire project out of it.

I hadn’t seen many tutorials or blogs that handle this topic, especially after Twitter’s v2 API launched. The number of people that I personally know of who would love for this feature to have existed earlier is more than what I can count with my fingers and toes, which can only mean one thing- a blog post.

After spending almost a full day hunting ways to retrieve replies to tweets- on my last attempt before giving up — I stumbled upon this solution. The moral of the story is that perseverance really pays off. It also culminated my greatest day as a programmer while using twitter’s API. If you made it this far and read this line- I owe you a coffee- but only after I graduate and land a job- conditions apply.

Future Directions

I don’t want you to ask what the point of getting these replies is, so I took the chance to list out some areas where this might be useful:

  • Measuring toxicity evoked by a tweet
  • Measuring bias or political leniency evoked by a tweet (left/right leaning response classification)
  • Studying the sentiment of public response to news
  • Studying public response to rumors and misinformation

For any comments, feedback, questions, chats etc. you can find me right here.

Adithya Narayanan is a Graduate Student at the University at Buffalo. He specializes in Operations Research and has a background in Computer Science. Reading news and commenting on it are his natural instincts.

He has an alter ego, who, when bored, loves creating digital content of all kinds but primarily prefers a pen and paper. He has created digital content for several larger than life sporting entities such as LaLiga Santander, Roland-Garros, and NBA Basketball School.

This is his Linkedin, and his email.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Adithya L Narayanan
Adithya L Narayanan

Written by Adithya L Narayanan

Works towards the data sciences in the free time he gets between watching “Billions” and reading the news. His alter ego likes writing.

Responses (3)