Simple Scrapping Data Twitter (Series Twitter Data)

Imam Muhajir
Analytics Vidhya
Published in
5 min readAug 15, 2021

--

Twitter is a free social networking microblogging service that allows registered members to broadcast short posts called tweets. Twitter members can broadcast tweets and follow other users' tweets by user multiple platforms and devices. You can access Twitter via the web or your mobile device. To share information on Twitter as widely as possible, we also provide companies, developers, and users with programmatic access to Twitter data throughout API.

For access to Twitter, the first we must have:

1. API Key

2. Api Secret Key

3. Access Token

4. Access Token Secret

To getting the access token you can look in here.

https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api

LET'S GO TO CODE

1. Setting Environment

open code script or notebook code. We can use a notebook. Lets to access the notebook in google collaboratory.

https://colab.research.google.com/

And new notebook create until you are ready

2. Install and Import Library

The library that will be used in this scrapping is advertools. so before that install the library on your computer.

in notebook code:

!pip install advertools

in cmd/terminal code:

pip install advertools

so, import your library:

Import pandas as pdImport numpy as npimport advertools as adv

3. Setting Token Access Twitter

auth_params = {'app_key': "xxxxxxxxxxxxxxxxxxxxxxx",'app_secret': "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ,'oauth_token': "xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx",'oauth_token_secret':"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",}adv.twitter.set_auth_params(**auth_params)

if you want so easy to get API, you can use my API.

Data that want get in this article is:

1. Get data from search keyword

This method can get all of the data about the keyword. For example, if you search “avengers” the result is tweeted that contain “avenger topic” so you can get tweet data this

So let go to scrap:

input_keyword = "avengers"
count_data = 200
df = adv.twitter.search(q = input_keyword, geocode=None, lang=None, locale=None, result_type=None, count=count_data, until=None, since_id=None, max_id=None, include_entities=None, tweet_mode=None)

You can customize the parameters of the scraper. So we can explain so detail every parameter because this is so powerful for you.

q = is the main parameter keyword for scrapping

(str — required) A UTF-8, URL-encoded search query of 500 characters maximum, including operators. Queries may additionally be limited by complexity.

Many kinds of value for the keyword.

  • geocode — (lat long dist — optional) Returns tweets by users located within a given radius of the given latitude/longitude. The location is preferentially taking from the Geotagging API but will fall back to their Twitter profile. The parameter value is specified by ” latitude, longitude, radius “, where radius units must be specified as either ” mi ” (miles) or ” km ” (kilometers). Note that you cannot use the near operator via the API to geocode arbitrary locations; however, you can use this geocode parameter to search near geocodes directly. A maximum of 1,000 distinct “sub-regions” will be considered when using the radius modifier.
  • lang — (str — optional) Restricts tweets to the given language, given by an ISO 639–1 code. Language detection is best-effort.
  • locale — (str — optional) Specify the language of the query you are sending (only ja is currently effective). This is intended for language-specific consumers and the default should work in the majority of cases.
  • result_type — (str — optional) Optional. Specifies what type of search results you would prefer to receive. The current default is “mixed.” Valid values include: * mixed: Include both popular and real-time results in the response. * recent: return only the most recent results in the response * popular: return only the most popular results in the response.
  • count — (int — optional) Specifies the number of results to retrieve.
  • until — (date — optional) Returns tweets created before the given date. The date should be formatted as YYYY-MM-DD. Keep in mind that the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
  • since_id — (int — optional) Returns results with an ID greater than (that is, more recent than) the specified ID. There are limits to the number of Tweets that can be accessed through the API. If the limit of Tweets has occurred since the since_id, the since_id will be forced to the oldest ID available.
  • max_id — (int — optional) Returns results with an ID less than (that is, older than) or equal to the specified ID.
  • include_entities — (bool — optional) The entities node will not be included when set to False.
  • tweet_mode — (str — optional) Valid request values are compat and extended, which give compatibility mode and extended mode, respectively for Tweets that contain over 140 characters

so, after that all, try to show data from scrapping

print(df.shape)print(df.head)

Columns of data are 79 column representation info for 1 tweet data, and 200-row representation 200 data tweet we get.

Thank you very much

Source:

https://whatis.techtarget.com/definition/Twitter

https://advertools.readthedocs.io/en/master/advertools.twitter.html

https://developer.twitter.com/

--

--

Imam Muhajir
Analytics Vidhya

Data Scientist at KECILIN.ID || Physicist ||Writer about Data Analysis, Big Data, Machine Learning, and AI. Linkedln: https://www.linkedin.com/in/imammuhajir92/