Complete Scrapping of Twitter Data (Series Twitter Data)

Imam Muhajir
Analytics Vidhya
Published in
11 min readSep 18, 2021

List of Content:
INTRODUCTION
BENEFIT
PREPARATION SCRAPING
SCRAPPING DATA

A. DATA BASED ON-SCREEN NAME
— 1. Get User LookUp
— 2. Get User Time Line
— 3. Get Favorites
— 4. Get Followers
— 5. Get Following/Friend
B. DATA BASED ON TWITTER POST
— 1. Get Post LookUp
— 2. Get User Retweeted Post
C. DATA BASED ON OWN ACCOUNT
— 1. Get Home Timeline
— 2. Get Retweeted
— 3. Get Mentioned
D. DATA BASED ON LIST
— 1. Get List Post
— 2. Get List Subscribers
— 3. Get List Subscriptions
— 4. Get Members List
D. ANOTHER DATA
— 1. Get Trending Twitter
—2. Get Search User
— 3. Get Search

INTRODUCTION

We all know that Twitter is a big social media. Many people from our environment, good friends, friends, family, artists, athletes, companies use Twitter. Do you know? With Twitter API we can retrieve a lot of information on a Twitter account such as post when it was posted, number of likes, number of retweets, number of followers, descriptions, and many other things. In this session, we will only discuss what data can be taken from someone’s Twitter account. Maybe for the next article, we will discuss visualization, sentiment, network, and other things.

BENEFIT

What are the benefits of scrapping the Twitter account data, one of which is that we can use the data to see someone’s habit patterns, someone’s tweet sentiment, someone’s network, and many insights that we can extract from the data?

PREPARATION SCRAPING

The tool we will use is a python notebook, to access a python notebook, it can be accessed using a local computer by downloading “Jupyter Notebook” or “Anaconda” which already has a Jupyter Notebook in it. Can also use the Cloud like google collab or Kaggle notebook.

Google collab : https://colab.research.google.com/

Kaggle Notebook: https://www.kaggle.com/code

Before retrieving and analyzing the first data, it is necessary to install several packages, the main package used in this scrapping is Advertools. For complete documentation, please refer to the following link:

Let’s code, First install some packages:

!pip install pandas 
!pip install numpy
!pip install advertools

Note: the installation can be done on the console and notebook, if you use the console when installing do not use “!” before writing code. Next, import each package above:

import pandas as pd 
import numpy as np
import advertools as adv

Furthermore, for the scrapping process, access to the Twitter API key is also required, where you can get it in the documentation

https://developer.twitter.com/en/docs/twitter-api

or for more detailed step-by-step how to get it you can also see some tutorials here on youtube or in other articles. If you are having trouble getting the Twitter API key you can use my Twitter API key temporarily by accessing the following link

auth_params = {
'app_key': "xxxxxxxxxxxxxxxxxxxxxxxx",
'app_secret': "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ,
'oauth_token': "xxxxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
'oauth_token_secret':"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
}
adv.twitter.set_auth_params(**auth_params)

SCRAPPING DATA

A. DATA BASED ON-SCREEN NAME

We can analyze and retrieve the data from the twitter account we already know. like from marvels, football clubs, basketball clubs, exes, girlfriends, and friends provided we know their username.

For the example above, “Marvel Entertainment” is the “Twitter name” and “@Marvel” is the “screen name”. the parameter needed for the scrapping process is “screen name” namely “@Marvel” Furthermore, there are some data that can be analyzed from these key parameters:

1. Get User LookUp

User Lookup contains a description of the user, such as name, screen_name, location, description, URL, total followers, total friends, total lister, a total status that has been created, total status that has been liked, and others. Where there are 44 columns/variables obtained in collecting data this time. Start to code.

lookup_user = adv.twitter.lookup_user(screen_name = 'Marvel')
print(lookup_user.shape)
lookup_user.head()

Besides being able to specify the screen_name parameter, there are several parameters to customize the captured data, the following parameters are include_entities, trim_user, map, include_exl_alt_text, include_card_uri, and tweet_mode.

2. Get User Time Line

The user timeline is a tweet made by the user, either in the form of the tweet itself, reply, or retweet. There are several main parameters used, namely “user_name” and “count”. the count is the number of tweets that we will take the user data. There are several columns/variables that we get, namely the time of making the tweet, full-text tweet, user_name, mentions, hashtags, total favorites, total retweets, user descriptions, and others. There are 75 columns/variables obtained in this scrapping.

timeline = adv.twitter.get_user_timeline(screen_name='Marvel',count=100)
print(timeline.shape)
timeline.head()

Besides being able to specify the screen_name and count parameters, there are several parameters to customize the data taken, the following parameters are user_id, since_id, max_id, trim_user, exclude_replies, include_rts, and tweet_mode.

3. Get Favorites

As in the picture above, get user favorite is the get tweet, or reply that is liked by the user. For parameter variables and variables obtained are the same as scrapping get user timeline.

get_favorites = adv.twitter.get_favorites(screen_name='Marvel', count=100)
print(get_favorites.shape)
get_favorites.head(3)

4. Get Followers

In addition to being able to retrieve data in the form of tweets, we can also retrieve data in the form of user information, one of which is the user information of the Twitter followers. Here there are 2 scrappings:

  • Scrapping id followers, In data retrieval, only the id, can retrieve relatively large amounts of data in a relatively fast time. The amount of data generated in one request is 5000 data. We can make several requests as needed. However, the drawback is that we can’t get detailed user screen name information. The main parameter needed is screen_name, and some additional parameters are cursor, stringifiy_ids, and count.
get_followers_ids = adv.twitter.get_followers_ids( screen_name="Marvel", count = 10000)print(len(get_followers_ids['ids']))
get_followers_ids

Output:
1000
{'previous_cursor': 0,
'next_cursor': 1710813885960930399,
'ids': [1437949410021978117,
1437948541901713409,
1437947905420058628,
1336494026925740032,
1437948713834483713,
782688591331766272,
1381602541243310085,
1437948842566160385,
1437949440740904961, … }

  • Scrapping User Description Followers, In collecting this data, we get complete data, namely name, screen_name, location, description, followers count, friends count, listed count, and others. However, the drawback in this scrapping is the relatively small amount of data compared to scrapping IDs, with the amount of data obtained in 1 x request is 100 data. The main parameters used in this scrapping are screen_name and count, and some additional parameters are user_id, cursor, skip_status, include_user_entities. The number of variables or column data obtained is 48 columns.
get_followers_list = adv.twitter.get_followers_list(screen_name= "Marvel" , count=100)
print(get_followers_list.shape)
get_followers_list.head()

5. Get Following/Friend

Get friend is to retrieve the following user’s data. As shown above. The parameters and variables obtained in this scrapping are the same as getting followers. There are 2 scams.

  • Scrapping id Following
get_friends_ids = adv.twitter.get_friends_ids( screen_name="Marvel", count =10000 )
print(len(get_friends_ids['ids']))
get_friends_ids

Output:
781
{'previous_cursor': 0,
'next_cursor': -1,
'ids': [155709853,
1152753054782910464,
1086709184874377217,
2401098489,
1148979262659145728,
1152749683938222080, … }

  • Scrapping User Description Followers
get_friends_list = adv.twitter.get_followers_list(screen_name= “Marvel” , count=200)
print(get_friends_list.shape)
get_friends_list.head(3)

B. DATA BASED ON TWITTER POST

After analyzing Twitter based on a username by taking some information such as timeline, likes, followers, following, and lookup. Next, we will analyze Twitter based on Twitter posts. We can access the Twitter post on the web with the data in the tweet_id column that we got in the previous scrapping. As an example:

If you have tweet_id: 1426928601610416135.

So to access the tweet, it is enough to add tweet_id to the following URLs

https://twitter.com/anyuser/status/<tweet_id>

Ex: https://twitter.com/anyuser/status/1438874044107948039

Output:

Above is the URL used to access Twitter using tweet_id. Next, let’s scrapping some of the information we got from the Twitter post entities:

1. Get Post LookUp

By knowing the tweet_id we can take a description of the tweet, such as, tweet_full_text, who mentions, what hashtag I used, the number of retweets, the number of likes, replies, and others. There are 72 columns that we will get in this scrapping.

lookup_status = adv.twitter.lookup_status(id = 1439010395587555333 )
print(lookup_status.shape)
lookup_status

2. Get User Retweeted Post

Furthermore, we can also scrape the retweet data from the tweet_id that we have, namely info about the user who retweeted our post, the data presented in the form of 2 types, namely tweet_id and a list of user descriptions. Conceptually, scrapping retweets based on tweet_id is the same as scrapping followers based on usernames.

  • Scrapping id retweeters
retweet_id = adv.twitter.get_retweeters_ids(id=1439010395587555333)
print(len(retweet_id['ids']))
retweet_id

Output:
{'previous_cursor': 0,
'next_cursor': 0,
'ids': [816170048,
391409829,
1066800427423944705,
1054503376514441217,
870775262694318080,
1421682772364513282,…]}

  • Scrapping User Description Retweeters
retweet = adv.twitter.get_retweets(id =1439010395587555333)
print(retweet.shape)
retweet.head(3)

C. DATA BASED ON OWN ACCOUNT

On Twitter we also retrieve data based on our own account, our scrapper account is determined by the developer account that we use to access the Twitter API key, if you use my developer API key then the data that will be retrieved is my data. If you want to retrieve data about yourself you will need to have your own developer account and use your own API key.

1. Get Home Timeline

This scrapping is one of the unique scrapping and different from other types of scraping because each time we scrapping can produce different data according to the timeline view that we have on the Twitter app.

home_timeline = adv.twitter.get_home_timeline()
print(home_timeline.shape)
home_timeline.head(3)

2. Get Retweeted

This is my post data that was retweeted by other people, remember not the status that I retweeted. But my status was retweeted by another user.

retweet_ofme = adv.twitter.retweeted_of_me(count = 100 )
print(retweet_ofme.shape)
retweet_ofme.head()

3. Get Mentioned

Mention timeline is tweet data that has to mention me in it. Whether it’s a tweet, reply, or retweet quote.

mention_timeline = adv.twitter.get_mentions_timeline(count=100)
print(mention_timeline.shape)
mention_timeline.head()

D. DATA BASED ON LIST

Twitter Lists allow you to customize, organize and prioritize the Tweets you see in your timeline. You can choose to join Lists created by others on Twitter, or from your own account, you can choose to create Lists of other accounts by group, topic, or interest. Viewing a List timeline will show you a stream of Tweets from only the accounts on that list. You can also pin your favorite Lists to the top of your Home timeline so you never miss a Tweet from the accounts that are most important to you.

How to discover lists? In your Home timeline on Twitter for iOS and Android apps, you might see a prompt to Discover new Lists. If we suggest a List to you that’s of interest, simply tap Follow. From the prompt, you can also tap Show more to browse through our Lists discovery page. There, we will show you more Lists we might think you’d like to follow and you can search for additional Lists in the search box at the top of the page.

We’ll also show you top Tweets from the Lists you follow right in your Home timeline.

The next data we can scrapping is a list. There are some data that we can scan based on the list. There are 2 main parameters that we can use for scrapping lists, namely id_list and user_name list.

The id_list can be found on the Twitter list URL, in the picture above the id_list is 1296872799932473345 and the user name list is @LearnedVector . after knowing these two parameters let’s code.

1. Get List Post

Scraping the status of the list. The parameter to be used is list_id.

list_statuses = adv.twitter.get_list_statuses(list_id=1214912982846590976)
print(list_statuses.shape)
list_statuses.head(3)

2. Get List Subscribers

list_subscribers = adv.twitter.get_list_subscribers(list_id =1214912982846590976)
print(list_subscribers.shape)
list_subscribers.head(3)

3. Get List Subscriptions

list_subscriptions = adv.twitter.get_list_subscriptions(user_id = 'benthecoder1')
print(list_subscriptions.shape)
list_subscriptions.head()

4. Get Members List

list_members = adv.twitter.get_list_members(list_id=1214912982846590976)
print(list_members.shape)
list_members.head()

E. ANOTHER DATA

1. Get Trending Twitter

Trending twitter is widely mentioned or discussed on the internet, especially on social media Twitter. Each place has a different trending topic, each country, and region has its own trending. Therefore, before we scrapping Twitter trending data, we must determine which area we will be scraping and the id for that area. Therefore, let’s check the data list through the following code.

available = adv.twitter.get_available_trends()
print(available.shape)
available

There are 8 columns and 467 rows, these 467 rows contain trending locations around the world, for each country or a certain region. For scrapping this time we will be scrapping the Indonesian state data with id 23424846.

trend = adv.twitter.get_place_trends(ids = 23424846 )
print(trend.shape)
trend

There are 50 top trendings obtained, and several columns containing trending, location, tweet_volume, local__rank, and others.

2. Get Search User

Search User allows us to retrieve search data for the user. For example, if we search for “marvel” it will generate several accounts related to “marvel”, as below:

search_user = adv.twitter.search_users(q = 'marvel', count=100)
print(search_user.shape)
search_user.head(3)

3. Get Search

The search method will generate tweets related to the generated keywords.

df = adv.twitter.search(q = "avengers", count = 100)
print(df.shape)
df.head()

In detail about the search scrapping tweet can be seen in one of my articles are:

Thank you for reading this article. Hopefully, this article can be useful and helpful in the development of data. don’t forget to follow and give a lot of claps. See you next time.

--

--

Imam Muhajir
Analytics Vidhya

Data Scientist at KECILIN.ID || Physicist ||Writer about Data Analysis, Big Data, Machine Learning, and AI. Linkedln: https://www.linkedin.com/in/imammuhajir92/