Simple Scrapping Data Twitter (Series Twitter Data)
Twitter is a free social networking microblogging service that allows registered members to broadcast short posts called tweets. Twitter members can broadcast tweets and follow other users' tweets by user multiple platforms and devices. You can access Twitter via the web or your mobile device. To share information on Twitter as widely as possible, we also provide companies, developers, and users with programmatic access to Twitter data throughout API.
For access to Twitter, the first we must have:
1. API Key
2. Api Secret Key
3. Access Token
4. Access Token Secret
To getting the access token you can look in here.
https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api
LET'S GO TO CODE
1. Setting Environment
open code script or notebook code. We can use a notebook. Lets to access the notebook in google collaboratory.
https://colab.research.google.com/
And new notebook create until you are ready
2. Install and Import Library
The library that will be used in this scrapping is advertools. so before that install the library on your computer.
in notebook code:
!pip install advertools
in cmd/terminal code:
pip install advertools
so, import your library:
Import pandas as pdImport numpy as npimport advertools as adv
3. Setting Token Access Twitter
auth_params = {'app_key': "xxxxxxxxxxxxxxxxxxxxxxx",'app_secret': "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ,'oauth_token': "xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx",'oauth_token_secret':"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",}adv.twitter.set_auth_params(**auth_params)
if you want so easy to get API, you can use my API.
Data that want get in this article is:
1. Get data from search keyword
This method can get all of the data about the keyword. For example, if you search “avengers” the result is tweeted that contain “avenger topic” so you can get tweet data this
So let go to scrap:
input_keyword = "avengers"
count_data = 200df = adv.twitter.search(q = input_keyword, geocode=None, lang=None, locale=None, result_type=None, count=count_data, until=None, since_id=None, max_id=None, include_entities=None, tweet_mode=None)
You can customize the parameters of the scraper. So we can explain so detail every parameter because this is so powerful for you.
q = is the main parameter keyword for scrapping
(str — required) A UTF-8, URL-encoded search query of 500 characters maximum, including operators. Queries may additionally be limited by complexity.
Many kinds of value for the keyword.
- geocode — (lat long dist — optional) Returns tweets by users located within a given radius of the given latitude/longitude. The location is preferentially taking from the Geotagging API but will fall back to their Twitter profile. The parameter value is specified by ” latitude, longitude, radius “, where radius units must be specified as either ” mi ” (miles) or ” km ” (kilometers). Note that you cannot use the near operator via the API to geocode arbitrary locations; however, you can use this geocode parameter to search near geocodes directly. A maximum of 1,000 distinct “sub-regions” will be considered when using the radius modifier.
- lang — (str — optional) Restricts tweets to the given language, given by an ISO 639–1 code. Language detection is best-effort.
- locale — (str — optional) Specify the language of the query you are sending (only ja is currently effective). This is intended for language-specific consumers and the default should work in the majority of cases.
- result_type — (str — optional) Optional. Specifies what type of search results you would prefer to receive. The current default is “mixed.” Valid values include: * mixed: Include both popular and real-time results in the response. * recent: return only the most recent results in the response * popular: return only the most popular results in the response.
- count — (int — optional) Specifies the number of results to retrieve.
- until — (date — optional) Returns tweets created before the given date. The date should be formatted as YYYY-MM-DD. Keep in mind that the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
- since_id — (int — optional) Returns results with an ID greater than (that is, more recent than) the specified ID. There are limits to the number of Tweets that can be accessed through the API. If the limit of Tweets has occurred since the since_id, the since_id will be forced to the oldest ID available.
- max_id — (int — optional) Returns results with an ID less than (that is, older than) or equal to the specified ID.
- include_entities — (bool — optional) The entities node will not be included when set to False.
- tweet_mode — (str — optional) Valid request values are compat and extended, which give compatibility mode and extended mode, respectively for Tweets that contain over 140 characters
so, after that all, try to show data from scrapping
print(df.shape)print(df.head)
Columns of data are 79 column representation info for 1 tweet data, and 200-row representation 200 data tweet we get.
Thank you very much
Source:
https://whatis.techtarget.com/definition/Twitter
https://advertools.readthedocs.io/en/master/advertools.twitter.html