Twitter text analysis-basic practice(1)

Published in

Emily Chen

4 min readDec 19, 2019

setting Twitter environment(computer > advanced setting > environment variable > add…)

Before importing the nltk Twitter, we should set the environment. Since I am using Window 10, I’ll first add variable in the user environment, you could name any names of the variable(TWITTER), and the value should be the path where you save the credentials.txt file.

credentials.txt file should be created by ourselves, we first registered an account in the Twitter developer website, then create an app, you would receive keys and tokens under the app directory.

Open up any editor, I’m using notebook. Copy the key and token above in order, then save this notebook file as credentials.txt in your twitter file.

app_key = API key
app_secret = API secret key
oauth_token = Access token
oauth_token_secret = Access token secret

After setting up the environment, we open a jupyter notebook to start practicing how to access tweets from Twitter and do further analysis.

from nltk.twitter import Twitter
tw = Twitter()
tw.tweets(keywords='happy, angry', limit=10)

Twitter class is a mean of interacting with the Twitter data stream. We look for words with either happy or angry and print out only 10 tweets that meets the constraint.

tw = Twitter()
tw.tweets(follow=['71026122', '167421802'], limit=10)

We filter the live public stream of Mcdonalds and Burgerking by looking for their numeric usersID, a way to convert the companys name to usersID is to enter the name in this website, you can then copy the numeric userID immediately.

from nltk.twitter import Query, Streamer, Twitter, TweetViewer, TweetWriter, credsfromfile
oauth = credsfromfile()
userids = ['71026122', '167421802']
client = Query(**oauth)
user_info = client.user_info_from_id(userids)
for info in user_info:
    name = info['screen_name']
    followers = info['followers_count']
    following = info['friends_count']
    print("{}, followers: {}, following: {}".format(name, followers, following))

The Search API lets us query for past Tweets. If we want to retrieve screen name and other information from Mcdonalds and Burgerking, we can simply modify the usersID above.

oauth = credsfromfile() this function looks for the credentials.txt file set previously, this is set by the environment variable TWITTER, reads content and pass the result as a dictionary. To initialize the client code, we pass this dictionary as an argument.

test 3 result

client = Streamer(**oauth)
client.register(TweetViewer(limit=10))
client.statuses.filter(follow=userids)

Streaming API lets us access near real-time Twitter data. The userID can be used as input to the Streaming API client.

client = Query(**oauth)                   
client.register(TweetWriter())            
client.user_tweets('mcdonalds', 10)

The above code demonstrates how to store data that Twitter sents by the Streaming API.

test 5 result (copy this result to notebook, we’ll use it later)

from nltk.corpus import twitter_samples
input_file = twitter_samples.abspath("C:/Users/user/twitter-files/tweets.20191218-215745.json")

If we want to work directly with file, and do other analysis on tweets, we can use above code rather than corpus reader. abspath() gives us full path name of relevant file. The path would be the one that we copied previously, and should change “\” to “/”.

from nltk.twitter.common import json2csv
from nltk.twitter.common import json2csv_entities
from nltk.corpus import twitter_samples
from nltk.twitter import Query, Streamer, Twitter, TweetViewer, TweetWriter, credsfromfile
with open(input_file) as fp:
    json2csv(fp, 'tweets_mctext.csv', ['text'])

Now we just want the text of the tweet, we put [‘text’] in the third parameter. The json2csv() takes a file-like object as a line-delimited json objects and returns a csv file. The second parameter is the new file name we would like to name(tweets_mctext.csv).

with open(input_file) as fp:
    json2csv(fp, 'tweets.20191218-215745.mctweet.csv',
            ['created_at', 'favorite_count', 'id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweet_count', 'retweeted', 'text', 'truncated', 'user.id'])

However, if we want to get the twitter metadata, we can modify the third argument and specify the relevant parts of the metadata. We then save these data into a new file called ‘tweets.20191218–215745.mctweet.csv’.

for line in open('tweets.20191218-215745.mctweet.csv',encoding="utf-8").readlines()[:5]:
    print(line)

And if we want to check what kind of data would be printed out, we can type in the code above.

Now we have the csv file, we want to convert the file into a data frame, panda library is a convenient method to do so.

import pandas as pd
tweets = pd.read_csv('tweets.20191218-215745.mctweet.csv', index_col=2, header=0, encoding="utf8")
tweets.head(10)

Finally, we retrieve only the ‘text’ column from the dataframe, and we can now start doing the text analysis process: Twitter text analysis-basic practice(2)

References:

Twitter text analysis-basic practice(1)

Written by Emily Chen