Using Tweepy to Retrieve Elon Musk’s Tweets and Analysis

Published in

Web Mining [IS688, Spring 2021]

9 min readFeb 4, 2021

According to news from CNN Business Channel on January 7th, Tesla’s stock price rose by more than 5% in the early part of the 7th, approaching a historical high of $800. Bloomberg pointed out that Musk’s net worth rose to $188.5 billion. The difference overpowered Bezos and replaced his position as the world’s richest man since 2017.

The 49-year-old Musk’s net worth was rocketing in 2020. At the beginning of 2020, it is worth 27 billion U.S. dollars and is only ranked within 50 in the world. However, due to the continued improvement in the company’s profit performance, Tesla’s stock price surged 743% last year, making Musk the fastest growing wealth in history ever.

So how was Elon Musk is performing in Social Media last year and what he is posting every day and what he is concerned about?

Tesla stock price overview from last year(2020 Feb-2021 Feb)

This article will use Tweepy to analyze Elon Musk’s tweet account. I used Twitter development account and Tweepy as a method to download his tweets from 2020–03–20 18:06:45 to 2021–02–02 08:45:48. In this dataset, it contains 3249 tweets. It includes those features(tweet.id_str, tweet.favorite_count, tweet.retweet_count, tweet.created_at, tweet.text.encode(“utf-8”)) as we may need.

In this dataset, the favorite_count and retweet_count mean how many users click likes and how many times of retweets count of those total 3249 tweets. The table below shows how the influence of Elon Musk’s tweets.

From the table, we can see that on average, every tweet he sent out, it would have 2869.4 retweet and 28329.5 likes. This really means he is the Opinion leader in Electric vehicles, clean energy areas. That makes me more curious besides those people known areas, what else is he caring out on Social Media.

Then I retrieved his following list to see what and who is Elon Musk’s caring. Right now he followed 103 accounts, including ‘@PyTorch’, ‘@NASA’ ‘@TeslaRoadTrip’, ‘@The New York Review of Â Books’, ‘@Khan Academy’. I used pandas in this analysis. In Python, pandas is a software library written for the Python programming language for data manipulation and analysis.

import pandas as pd

df = pd.read_csv('elonmusk_friends.csv')

It shows all the areas in this file as follow:

The general statistical is as following:

Statistical features of Elon Musk’s following account

Before we do the analysis, we need to understand how we can get somebody’s tweet information. In order to use Twitter to retrieve account information, we need to have a Twitter developer account(https://developer.twitter.com), after sign up and get the consumer_key, consumer_secret, access_key, access_secret. Then we need to use Twitter API and Tweepy package in Python to get the tweets we want.

import tweepyauth = tweepy.OAuthHandler(consumer_key, consumer_secret)auth.set_access_token(access_key, access_secret)api = tweepy.API(auth)

At first, it only allows me to retrieve the first 200 tweets of his account. I found it because 200 is the maximum allowed count for one account one time.

new_tweets = api.user_timeline(screen_name=screen_name, count=200)

In the count=200, no matter how many numbers I input larger than 200, it will only give me the most recent 200 tweets. But since he sends lots of tweets daily, it is not enough for me to do the analysis. Then I use alltweets.extend(new_tweets) to pass the limit to get the most around 3200 tweets to use.

Using alltweets.extend to retrieve the most recent 3200 tweets of each account

Also, the initial format of the Tweets was JSON format. What is a JSON file? JSON is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and array data types. This kind of data is hard for us to use and analyze. We need to reformat this dataset.

I use the following function to write it into a CSV file. So I can easily make it to read and write.

import requests
import jsonwith open('%s_elonmusk.csv' % screen_name, 'a') as f:
writer = csv.writer(f)
writer.writerow([" id ", " favorite_count ", " retweet_count ", " created_at ", " text "])
writer.writerows(outtweets)

The ‘requests’ Python module takes care of both retrieving JSON data and decoding it, due to its built-in JSON decoder.

After storing the dataset in .CSV file, we could have a clear view as the picture shows below.

This dataset includes the contents of the tweets, which are the most important part to show us his last year’s attitude and opinion since he is very active on Twitter, From our dataset during 319 days, he sends out 3249 tweets in total. He is sending out almost ten tweets every day from 2020–03–20 18:06:45 to 2021–02–02 08:45:48. Also, he did influence the stock market.

GameStop stock price overview from last month, achieving highest on 01/27/2021

For example, The nation’s largest video game retailer, GameStop shares soared 92.61 percent on Tuesday to close at $ 147.98. After Tuesday’s trading hours, GameStop’s share price rose further, reaching 67% at one time, because Tesla CEO Musk mentioned the company in a Twitter post on 01/26/2021.

Musk is retweeting this Reddit link of ‘wallstreetsbets’ group with the comment: Gamestonk!!!

Right now it has 252.4K likes and 50.9K comments on this single tweet. It shows us that Elon Musk’s tweets are moving markets. The tweet appeared to help GameStop’s valuation to skyrocket to more than $10 billion in after-hours trading and resulted in some amateur trading apps pausing trading. But some people stand to lose a lot of money if GameStop’s share price comes crashing down. Eventually, his tweets can have devastating consequences for retail investors whilst he and his friends enrich themselves at the expense of the little guy.

Musk has faced problems with the SEC for tweeting about Tesla’s stock. In August 2018, he said he wanted to take Tesla, privately, at $420 per share and that he had secured the funding to do so. Musk and Tesla each had to pay the SEC a $20 million fine to settle the suit, and Musk has since agreed to submit his public statements about Tesla’s finances and other topics to vetting by its legal counsel. He infamously tweeted last year that Tesla’s stock was “too high,” sending shares down more than 10% immediately, though they more than rebounded within a week. While Musk’s Twitter actions have had a particularly pronounced effect this week, he’s been shifting stocks and cryptocurrencies for a while now. Earlier this month, Musk urged his 48.3 million followers to use encrypted messaging app Signal, which is operated by a nonprofit.

So I decided to analyze his contents to see his unique words, distribution, and topic modeling the contents of his tweets.

There are 8308 unique tokens in this dataset. After run with the most frequent words, it shows me: {Number of words in the file : Counter({‘to’: 630, ‘the’: 624, ‘is’: 573, ‘a’: 497, ‘of’: 456, ‘&’: 348, ‘in’: 311, ‘for’: 269, ‘be’: 228, ‘will’: 211, ‘on’: 200, “b’RT”: 198, ‘with’: 170, ‘I’: 169, ‘but’: 166, ‘@SpaceX’: 164, ‘that’: 161, ‘are’: 158}. Most of them seem some meaningless words. So I add stopwords to remove some of the words.

So, what is “stopwords”? Stop words are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead.

Stop words are generally thought to be a “single set of words”. It really can mean different things to different applications. For example, in some applications removing all stop words right from determiners (e.g. the, a, an) to prepositions (e.g. above, across, before) to some adjectives (e.g. good, nice) can be an appropriate stop word list.

After cleaning the dataset using Stopwords, Then it gives me this:

{‘@SpaceX’: 164, “b’@PPathole”: 145, ‘@Tesla’: 141, “b’@flcnhvy”: 140, ‘Tesla’: 135, “b’@Erdayastronaut”: 128, ‘@SpaceX:’: 104, ‘@Erdayastronaut’: 87, ‘@flcnhvy’: 86, ‘@thirdrowtesla’: 78, ‘@PPathole’: 69, “b’@teslaownersSV”: 68, ‘@NASASpaceflight’: 61, “b’@Teslarati”: 46, ‘launch’: 43, “b’@cleantechnica”: 43, ‘Falcon’: 43, “b’@WholeMarsBlog”: 39, ‘Dragon’: 38, ‘@NASA’: 38, ‘@teslaownersSV’: 38, ‘high’: 38, ‘Model’: 37, ‘SpaceX’: 35, ‘@WholeMarsBlog’: 35, ‘@SciGuySpace’: 35, ‘Starship’: 35.}

From this frequent word set, we can say that Elon Musk still focuses on his career. Not only just Tesla electric car but his SpaceX project and his rocket project cooperation with NASA. He really cares about space exploration, he mentions and retweets so many times about the WholeMarsBlog’s contents. Also, we can see how he cares about clean and sustainable energy.

I also did a Latent Dirichlet Allocation (LDA)Topic Modeling of his contents. LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics.

This is when we run all Elon Musk’s 3249 tweets as 3249 documents, and then we add stopwords into those documents to remove meaningless words as we did before in the frequent words counting process. We can print out the top 6 related topics with the top 10 keywords. Also, we could define the topic and keyword numbers as we needed. In this article, we just show the first six topics.

Latent Dirichlet Allocation (LDA)Topic Modeling of Tweets contents

From those topics, we could find from topic 3 and topic 6 that he concerns about his Tesla Super factory in the world, including Berlin, Germany, and China. This will definitely influence the productivity and the delivery time of the Tesla vehicle. Some other topics like he is happy and proud of his third-row electric tesla Falcon.

How to achieve LDA topic modeling, we need to use gensim and pyLDAvis.gensim packages.

import gensim
import pyLDAvis.gensim
from gensim import corpora

NUM_TOPICS = 6 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in texts]

# use gensim to create LDA model
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
# print related review topic
topics = ldamodel.show_topics()
for topic in topics:
    print(topic)

In this article, I want to know how Elon Musk’s tweeter content is distributing since last year his Tesla became one of the world’s most valuable companies. In conclusion, when we take his Tweets contents as a whole from last year, we could see that how he is focusing on space exploration, clean energy plan, and why the company’s net worth is growing is so fast. It certainly because it meets the environment-friendly standard the long future of all human beings in the world. Also, we know that Mush is developing the Starlink program that has a whole set of satellites in space to give everyone an equal chance to use the high-speed internet. It will be another revolutionary in the Electric Communication area.

The main limitation of this analysis is also lacking sufficient datasets to do a more comprehensive and deep comparison. Furthermore, I need the stock market dataset of Tesla and NASDAQ to do a comparative analysis could be more convicning.

Due to the Twitter API limit, right now, I can only access his most recent 3200 tweets. Also, I only did some of the basic content analysis in his tweets. This could be insufficient. If I add more features to the dataset, we could extract more useful information. Besides, I could add how his tweet content might influence Tesla’s stock price by using the stock price changing time period with his Tweets content timeline. Next, I can do some graph analysis and use other features such as interaction features from the JSON file(in_reply_to_screen_name’, ‘in_reply_to_status_id’, ‘in_reply_to_status_id_str’, ‘in_reply_to_user_id’,’in_reply_to_user_id_str’) to draw the Social Network distribution. This could better visualize his network connection. Finally, since I have his following file, I can do graph analysis in the future using NetworkX to see how he interacts with each other.

References

Swayambhu Chatterjee, Shuyuan Deng, Jun Liu, Ronghua Shan &Wu Jiao, Classifying facts and opinions in Twitter messages: a deep learning-
based approach, Aug 2018.
Twitter developer, https://developer.twitter.com/en
David M. Blei, Andrew Y. Ng, Michael I. Jorda, Latent Dirichlet Allocation, Jan 2003.
Kavita Ganesan, What are Stop Words? AI Implementation, Text Mining Concepts
JSON, Wikipedia, https://en.wikipedia.org/wiki/JSON
Elon Musk Twitter, https://twitter.com/elonmusk

Using Tweepy to Retrieve Elon Musk’s Tweets and Analysis

Written by Tao Yao