Analyzing Twitter Data in Python (part2)
Basics of analyzing twitter data through twitter APIs, loading and accessing tweets using Tweepy library , and analyze JSON data
Introduction
continuing from part 1, the introduction article…
Numerous marketers see success in their social media marketing strategies by paying closer attention to Twitter analytics.
Whether it’s Tweets, impressions, engagements or clicks, there are several metrics that give you additional insights into how well you’re resonating with your audience. A common problem for companies is they don’t know the best route to take when analyzing Twitter data. This means brands have no idea why a piece of content did extremely well or fell flat.
Looking into the data not only shows you what was working, but it gives you more insights to be successful on your next campaign. That’s why brands are increasingly turning to social data as a key source of business intelligence to help drive marketing decisions and provide crucial insights across the organization.
Here we shall see how to use Twitter API to get ‘sample’ ( 1% of entire twitter data) and ‘filter specific’ data and to analyze them to get insights for business.
Twitter API
How to collect data ?
Many social media companies have APIs which they expose to be used by developers or researchers. APIs are a way that businesses and govt organisations use to expose their data to outside users or third party vendors. Twitter has more than one API for this purpose :
a. Search API — allows access to tweets from past week
b. Ads API — for twitter ads
c. Streaming API — allows to collect sample of tweets in real time based on keywords, userIDs and location. It has 2 endpoints :
filter — to request data on a few hundred keywords, a few thousand usernames and 25 location ranges.
and sample — returns a 1% sample of entire Twitter
Tweepy : used to collect data from streaming API.
It abstracts much of the work needed to setup a stable twitter streaming API connection.
We need 2 steps before we can start using Tweepy :
- setup a twitter developer account and
2. Create your application and get API keys for authentication
You will need:
- api_key
- api_secret
- access_token
- access_secret
Next, authenticate your app :
import tweepy
api_key = "..."
api_secrets = "..."
access_token = "..."
access_secret = "..."
# Authenticate to Twitter
auth = tweepy.OAuthHandler(api_key,api_secrets)
auth.set_access_token(access_token,access_secret)
api = tweepy.API(auth)
try:
api.verify_credentials()
print('Successful Authentication')
except:
print('Failed authentication')
SListener : Tweepy requires and object called SListener that tells us how to handle incoming data.
SListener code :
from tweepy import Stream
import time
class SListener(Stream):
def __init__(self, api = None):
self.output = open('tweets_%s.json' %
time.strftime('%Y%m%d-%H%M%S'), 'w')
self.api = api or API()
...
SListener object inherits from a general Stream class included with tweepy. It opens a new timestamped file in which to store tweets and takes an optional API argument.
Tweepy authentication :
from tweepy import OAuthHandler
from tweepy import API
auth = OAuthHandler(api_key, api_secrets)
auth.set_access_token(access_token, access_secret)
api = API(auth)
OAuthentication, the authentication protocol which the Twitter API uses, requires four tokens which we obtain from the Twitter developer site: the consumer key and consumer secret, and the access token and access token secret. We pass the OAuthHandler our consumer key and consumer secret. Then we set the access token and the access token secret. Finally, we pass the auth object to the tweepy API object.
More info on tweepy authentication over here.
Collecting data with Tweepy :
We use the sample endpoint, to collect a random sample of all twitter data.
a. first initiate the SListener object
b. then instantiate the stream object
c. call the sample method to begin collecting data
from tweepy import Stream
listen = SListener(api)
stream = Stream(auth, listen)
stream.sample()
Collecting data on keywords
Now that we’ve set up the authentication, we can begin to collect Twitter data. Through the Streaming API, we will be collecting real-time Twitter data based on either a sample or filtered by a keyword.
In our example, we will collect data on any tweet mentioning #rstats
or #python
in the tweet text, username, or user description with the filter
endpoint.
from tweepy import Stream
# Set up words to track
keywords_to_track = ['#rstats', '#python']
# Instantiate the SListener object
listen = SListener(api)
# Instantiate the Stream object
stream = Stream(auth, listen, access_token, access_token_secret)
# Begin collecting data
stream.filter(track = keywords_to_track)
Understanding Twitter JSON
After collecting data using the defined filter, we will look into the structure of this data : JSON objects.JSON is a special data format which is both human-readable and is easily transferred between machines. JSON is structured a lot like Python objects and is composed of a combination of dictionaries and lists.
Contents of twitter JSON :
Before we can analyze the data obtained for any further use, its important to understand the JSON.
A single original tweet (which is not a retweet or quoted tweet) has a lot of data :
— foundational information like the text, when it was created, and the unique tweet ID.
— information like how many retweets or favorites it has at the time of collection, what language it’s in, if it’s a reply to a tweet and to which tweet, and to which user.
— child JSON objects : like dictionaries stored in other dictionaries.
‘user’ child being one of them, contains all the useful information we want to know about the user who tweeted, including their name, their Twitter handle, their Twitter bio, their location, and if they’re verified.
‘place’ contains information on the geolocation of the tweet.
An example result of the streaming API ( from Github)
Accessing JSON :
import json
tweet_json = open('tweet-example.json', 'r').read()
tweet = json.loads(tweet_json)
tweet['text']
- use of open() and read() methods to load JSON file into the JSON object
2. use jason package and loads method to convert the json object to python dictionary
3. use appropriate key for accessing the value of interest from this dictionary
4. Child Twitter JSON can be accessed as nested dictionaries. eg ‘user’
tweet['user']['screen_name']
tweet['user']['name']
tweet['user']['created_at']
Putting it all together with example :
# Load JSON
import json
# Convert from JSON to Python object
tweet = json.loads(tweet_json)
# Print tweet text
print(tweet['text'])
# Print tweet id
print(tweet['id'])
Accessing user info :
in our example we can see user info as :
code :
# Print user handle
print(tweet['user']['screen_name'])
# Print user follower count
print(tweet['user']['followers_count'])
# Print user location
print(tweet['user']['location'])
# Print user description
print(tweet['user']['description'])
Accessing retweet data :
# Print the text of tweet which has been retweeted
print(rt['retweeted_status']['text'])
# Print the retweet count of the tweet
print(rt['retweeted_status']['retweet_count'])
# Print the user handle of the tweet which has been retweeted
print(rt['retweeted_status']['user']['screen_name'])
We shall see in next post of the series how to process twitter text .