Thoughts on #VisionZero: first steps with the Twitter API and Word2Vec for text analysis

Published in

DataExplorations

8 min readOct 28, 2018

With shortening days and the upcoming end of Daylight Savings time here in Ontario, pedestrian safety is top of mind for many people. So this week I decided to analyse recent Tweets about #VisionZero and experiment with some basic Natural Language Processing (NLP) techniques.

Use Twython to get Twitter Data

I decided to use the Twython wrapper to simplify calling Twitter’s API.

# pip install twython
from twython import Twython

The first step is to instantiate a twython object and pass in your Twitter Credentials (fair warning — it can take a fair while to get signed up to use the Twitter API!). I stored my credentials in a dictionary named creds

# Instantiate an object
python_tweets = Twython(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'])

Next I defined my base search query term (#VisionZero), specified how many times I wanted to go back to get more search results and set up the base dictionary that I’ll use to store my results

base_query_term = "#VisionZero" # term we are searching for
max_iters = 50 # controls how many times we go back to get additional pages
dict_ = {'id': [], 'date': [], 'text': [], 'favorite_count': [],'location':[]}

A few notes on using Twitter Search:

Twitter search only has about a week’s worth of tweets available for searching, so it’s not going to retrieve any tweets that are older than that.
You can also only retrieve up to 100 tweets in a given call (controlled by the parametercount:100 .) I nested my query in a loop to retrieve 100 Tweets at a time — on each loop, I pass in the id of the last retrieved Tweet ( max_id) so Twitter will return the next 100 older Tweets
In order for the looping scenario above to work, the Tweets need to be in ordered from newest to oldest — this can be accomplished by setting result_type:’recent'
Note: the above looping approach is a bit of a hack — alternatively there is a Twython cursor that you can use to loop through all the possible results (i.e. results=python_tweets.cursor(python_tweets.search, **query)but it was giving me intermittent errors when reading certain Tweets and so I decided to go back to the regular search which seems to be more consistent
Twitter recently made changes to allow tweets > 128 characters and I noticed that longer Tweets were getting cut off in the status['text']. The solution is to set 'tweet_mode':'extended'and retrieve the status['full_text']value
We can exclude Retweets by filtering for the prefix (RT @)

max_id = ""for call in range(0,max_iters):query = {'q': base_query_term,  
        'result_type': 'recent',
        'count': 100,
        'lang': 'en',
         'max_id': max_id, # what tweet id to start retrieving from
        'tweet_mode':'extended',
        'include_entities': False
     }for status in python_tweets.search(**query)['statuses']:  
        if  'RT @' not in (status['text']):
            dict_['id'].append(status['id'])
            # dict_['user'].append(status['user']['screen_name'])
            dict_['date'].append(status['created_at'])
            dict_['text'].append(status['full_text']) # > 128 chars
            dict_['favorite_count'].append(status['favorite_count'])
            dict_['retweet_count'].append(status['retweet_count'])
            dict_['location'].append(status['user']['location'])
            max_id = status['id'] # store last tweet accessed

We can then stuff the retrieved Tweets into a data frame for further analysis

df = pd.DataFrame(dict_)  
df.sort_values(by='id', inplace=True, ascending=False)

Tokenize the Tweets

Now that we have a collection of Tweets, our next step will be to clean up the text. First we’ll import the necessary libraries:

import nltk
nltk.download('stopwords')  # run once
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

from nltk.tokenize import TweetTokenizer
import string
import re

Next we can define a function to clean up the tweets, which

uses the NLTK TweetTokenizer to convert each Tweet into list of individual words (setting strip_handles to True removes the various mentions of other Twitter users from the results (i.e. “@user”))
removes stop words (relatively meaningless words, such as “a”, the”), non-tweet specific punctuation (leaving the # symbol) and urls
converts everything to lowercase
lemmatizes the remaining words (lemmatization is also often referred to as “stemming” and tries to find root words, so that, for example, “cat” and “cats” are counted as the same word)

def tweet_preprocessor(line, ret_type = 'string'):# use the Tweet Tokenizer to get a list of words
    tknzr = TweetTokenizer(preserve_case=False, strip_handles=True)
    word_list=tknzr.tokenize(line.strip().lower())    # remove stopwords
    stops = set(stopwords.words('english'))
    # define a few extra stop words to include
    extra_stops = ['‘','’','rt','…','—','.', ',','-','(',')','&','?','!','$','<','>','/','*']
    meaningful_words = [w for w in word_list if not w in stops and not w.startswith('http') and not w in extra_stops] 
    
     # stem/lemmatize (look for root words)
    meaningful_words = [lemmatizer.lemmatize(i) for i in meaningful_words]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    if ret_type =='string':
        return(" ".join(meaningful_words))
    else:
        return meaningful_words

This function takes the original tweet, like this:

‘Our @NYPDauxiliary Unit assisting with crossing pedestrians on a busy Friday Night at Atlantic Av and Logan St. #VisionZero https://t.co/rcZcofWIlf '

and returns a “cleaned” version like this that contains only the meaningful words:

unit assisting crossing pedestrian busy friday night atlantic av logan st #visionzero

Now we’ll use a list comprehension to loop through all our retrieved Tweets and clean them up:

cleaned_tweets = [tweet_preprocessor(row) for row in df['text']]

Vectorizing the Tweets

So now that we have a fairly clean and simplified version of our Tweets, we might want to Vectorize the results in order to extract the most commonly used words or to feed the Tweets to a clustering algorithm.

With basic count vectorization, we create a very sparse matrix where the columns are all the individual words appearing anywhere in our collection of Tweets (the “corpus”) and the rows are the individual tweets. So if our first tweet only contained two words (“pedestrian” and “safety”), row one would have zeros in all columns except the Pedestrian and Safety columns. This is called a “bag of words” and is only concerned with whether a word appears or not in the next and ignores things like positioning, emphasis, weighting etc.

For the purposes of this analysis, I just used the basic CountVectorizer from SkLearn.

from sklearn.feature_extraction.text import CountVectorizer

The token_pattern below tells it to preserve the hashtags (#) in the tweets (credit to the examples page on ProgramCreek for the regex).

#remove mentions but keep hashtags with their sign
token_pattern = r'(?u)(?<![@])#?\b\w\w+\b'cvec = CountVectorizer(stop_words='english',token_pattern=token_pattern)

We can then fit and transform our set of cleaned tweets

cvec.fit(cleaned_tweets)
X_train_counts=cvec.transform(cleaned_tweets)

The transform call returns a sparse_matrix with the columns representing words and the rows representing tweets

The sparse matrix can be stuffed into a dataframe for easier analysis

all_tweets_df = pd.DataFrame(X_train_counts.toarray(), columns=cvec.get_feature_names())

We can use this dataframe to find the top 10 most frequently occurring words in our tweet collection

top_words = all_tweets_df.sum().sort_values(ascending=False).iloc[1:100]
top_words=top_words.to_frame()
top_words=top_words.reset_index().rename(columns={'index': 'word', 0:'freq'})
top_words.head(10)

Create a WordCloud Diagram

The above tabular representation of the most commonly used words is ok, but perhaps it would be nicer to turn this into a WordCloud. To do that, we’ll import the WordCloud module

#pip install wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

WordCloud accepts a dictionary where the key is the word and the value is the number of times it appears. So we’ll convert our dataframe with the top 100 words to a dictionary

freq_dict = top_words.set_index('word').to_dict()['freq']

And create our WordCloud

wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate_from_frequencies(freq_dict)
plt.figure()plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

WordCloud showing Top 100 mostly commonly used words in Tweets containing #VisionZero

Use Word2Vec to analyze word similarity

Word2Vec is a way to determine how similar words are to each other. Very simplistically, if you often find word A together with word C and word B together with word C, then perhaps words A and B are similar. Word2Vec is a neural network that embeds words in vector space. It allows you do math on words — i.e. brother- boy + girl = sister. It works best on a very large dataset so it can learn all the subtleties of word interactions, but I wanted to experiment with training it on this Tweet collection.

I’m using the Gensim version of Word2Vec

import gensim
from gensim.models.word2vec import Word2Vec

Word2Vec requires that the sentences be structured as a list of lists where we end up with a list of each tweet containing the words in that tweet. For example:

sentences = [[‘this’, ‘is’, ‘the’, ‘first’, ‘sentence’, ‘for’, ‘word2vec’], [‘this’, ‘is’, ‘the’, ‘second’, ‘sentence’], [‘yet’, ‘another’, ‘sentence’], [‘one’, ‘more’, ‘sentence’], [‘and’, ‘the’, ‘final’, ‘sentence’]]

So let’s clean up our tweets

list_tweets = []
for tweet in cleaned_tweets:
    list_tweets.append(tweet_preprocessor(tweet, 'list'))

Now we can train Word2Vec

model = Word2Vec(list_tweets, min_count = 0, workers=2)
model.train(list_tweets, total_examples=len(list_tweets), epochs=10)

Once our model is trained, we can query various words to find what Word2Vec has learned are the most similar words

model.most_similar('accident')

model.similar_by_word('pedestrian')

Intriguingly you can also look for similarities for groups of words:

model.wv.most_similar(positive=['traffic', 'accident', 'pedestrian'], topn=1)

Word2Vec has done an OK job on our very limited set of Tweets, so it would be interesting to train it on a very large volume of text.

In theory, you should also be able to use Word2Vec to find words that don’t belong, but it hasn’t quite got the hang of it for this small set of text:

model.wv.doesnt_match("bike trail orange cycle".split())
>> bike

Sentiment Analysis using TextBlob

As a final step, I wanted to run some simple Sentiment Analysis against the gathered tweets. I used TextBlob for this (note you’ll have to install and import the corpora outside of Jupyter notebook)

# pip install textblob
# download corpora python -m textblob.download_corpora
from textblob import TextBlob

I defined a function to analyse a single Tweet

# create TextBlob object of passed tweet text 
def get_sentiment(tweet):
    analysis = TextBlob(tweet) 
    # set sentiment 
    if analysis.sentiment.polarity > 0: 
        return 'positive'
    elif analysis.sentiment.polarity == 0: 
        return 'neutral'
    else: 
        return 'negative'

Then I ran it for all my tweets and stuffed the results into a dataframe

tweet_sentiments= [ get_sentiment(tweet) for tweet in cleaned_tweets]
sentiment_dict = {'tweet': cleaned_tweets,
                 'sentiment': tweet_sentiments}tweet_sentiments_df = pd.DataFrame(sentiment_dict)

As you can see from the sample above, the results are a bit mixed, but our data probably needs some more cleaning for this type of analysis to work well

According to this analysis, 52.4% of the Tweets about #VisionZero were positive, 27% were neutral and 20.5% were negative.

Final Thoughts

The types of analyses done in this post would benefit from having a much larger dataset than a single week’s worth of Tweets. But it was still illuminating to try out the techniques on some real world data.

The code used in this post can be found on GitHub.