Extracting and analyzing tweets related to global warming using Tweepy

Whether the global warming phenomenon is occurring or not has always been at the center of public debate for many years. On one side, it is claimed that the earth’s temperature is rising fast and there will be unprecedented natural disasters in the coming days, and on the other side, it is claimed that global warming is just a hoax and no such events will occur. Amidst this, it is interesting to see what the majority of people are talking about global warming in social media and on which side of the debate most of them belong to.

Social media has provided platforms, where users can provide their feedback about any issue. The data posted on these platforms are raw and direct opinions of the people, which can provide invaluable insights about the people’s perception of the situation. Among social media platforms, Twitter holds a large mine of data. All the user's data posted on the site is public and can be accessed using Twitter API. Additionally, Twitter has a hashtag culture which has made it easier to sort and collect data related to a specific topic of interest.

Utilizing the data available on Twitter, I have analyzed tweets related to global warming which are posted on the site using the hashtag #globalwarming. The tweets collected from twitter are personal posts of the platform users. In my opinion, it is unethical to disseminate any personal information of the users without direct consent from them. Hence, only the high-level findings of the analysis have been presented in this post.

The first step for the analysis is collecting relevant tweets from Twitter API. Following are the steps to access data from Twitter:

  1. Create a developer account to access Twitter API using twitter. After the developer account has been created, an app should be created. This enables the access of “Keys and Access tokens” for the project. These tokens should be saved to be used in the later stage.
  2. The next step is to install the Tweepy library in python. This can be accomplished by running “pip install tweepy” in the command line.
  3. After the Tweepy library is successfully installed, open the python notebook, and start programming to collect tweets.
  • First import tweepy library into python
    import tweepy
  • Save the keys and tokens obtained earlier in the respective variables.
consumer_key = "*****************"
consumer_secret = "*************************"
access_token = "****************************"
access_token_secret = "************************"
  • Create an API object to access twitter API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
  • Collect tweets and save it in a csv file for future use. For simplicity, only 11,888 tweets were collected and analyzed.
file = open('globalwarming.csv', 'w',encoding='utf-8')
csvWriter = csv.writer(file)
for tweet in tweepy.Cursor(api.search,q="#globalwarming",lang="en", count = 20000, since="2020-01-01").items():
csvWriter.writerow([tweet.created_at, tweet.text])
file.close()

After collecting tweets, one can read the tweets from csv file and then start analyzing them. Since the tweets are text and not very clean, it should be cleaned to get good result. Following are the steps undertaken in analyzing tweets:

  • Firstly, the tweets saved in the CSV file were read in IDE using the pandas library. The dataframe with two columns was created, namely ‘Time’ denoting the time when Tweet was posted and ‘Tweets’ denoting the text in tweets itself.
data = pd.read_csv("globalwarming.csv", header = None, encoding='utf-8', names = ['Time', 'Tweets'])
  • While observing the text in tweets, it was found that most of the tweets were not in plain text. There were symbols and links which highly reduced the readability of the text.
  • Therefore, the next step was to clean the text. At first, the hashtags related to global warming itself were removed from the tweets as it will not provide any additional information about the content in tweets. The Twitter usernames which were tagged in the tweets were also removed as their contribution to the sentiment of the tweet was assumed to be less significant due to the limited scope and size of the analysis. There were several hyperlinks in tweets that were also removed to simplify the analysis. Finally, all the punctuation symbols were removed, and then the text was converted to lower case.
#removing hashtags related to globalwarming
def rem_hashtags(text):
processed_text = re.sub(r"#globalwarming", "", text)
processed_text = " ".join(processed_text.split())
return processed_text
data['Tweets'] = data['Tweets'].apply(lambda x:rem_hashtags(x))
#removing tagged users from the tweets
def remove_users(text):
processed_text = re.sub(r'@\w+ ?',"",text)
processed_text = " ".join(processed_text.split())
return processed_text
data['Tweets'] = data['Tweets'].apply(lambda x:remove_users(x))
#removing hyperlinks mentioned in the tweets
def remove_links(text):
processed_text = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", text)
processed_text = " ".join(processed_text.split())
return processed_text
data['Tweets'] = data['Tweets'].apply(lambda x:remove_links(x))
#removing punctuations in the tweets
def remove_punct(text):
punctuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
text = "".join([char for char in text if char not in punctuations])
text = re.sub('[0-9]+', '', text)
return text
data['Tweets'] = data['Tweets'].apply(lambda x: remove_punct(x))
#making all tweets lowercase
def lowercase_word(text):
text = "".join([char.lower() for char in text])
return text
data['Tweets'] = data['Tweets'].apply(lambda x: lowercase_word(x))

Making WordClouds

After cleaning the text, a word cloud was generated to visualize some of the highly repeated words in the tweets. Most frequent words often give an idea about the topics that people are more interested or concerned about.

tweet_All = " ".join(tweet for tweet in data['Tweets'])fig, ax = plt.subplots(1, 1, figsize  = (30,30))
# Create and generate a word cloud image:
wordcloud_ALL = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(tweet_All)
# Display the generated image:
ax.imshow(wordcloud_ALL, interpolation='bilinear')
ax.axis('off')
Top 100 words in tweets

From the wordcloud, it can be seen that the majority of the people were tweeting/talking mostly about “climate change”. The name of one of the professors (‘Richard’) also frequently came in the tweets. The words like “phd exposes”, “phd” , “scientist”, “professor” indicates that there were some talks about the findings from the research related to global warming. There were also talks about “fire” and “wildfire” capturing the recent events ongoing in the US which some people considers as the aftereffect of global warming. Some words like “face lie”, “bald face”, “bestlie” shows some people’s strong opinion against global warming phenomena.

Sentiment Analysis

After knowing the most discussed topics in the tweets related to global warming, it is interesting to know how these tweets are polarized. Therefore, sentiment analysis on tweets was done using the TextBlob library. TextBlob library provides a simple API for doing Natural Language Processing tasks including sentiment analysis of the texts.

def get_tweet_sentiment(data): 
if data > 0:
return 'positive'
elif data == 0:
return 'neutral'
else:
return 'negative'
tweets = [TextBlob(tweet) for tweet in data['Tweets']]
data['polarity'] = [b.sentiment.polarity for b in tweets]
data['subjectivity'] = [b.sentiment.subjectivity for b in tweets]
data['sentiment'] = data['polarity'].apply(get_tweet_sentiment)
data['sentiment'].value_counts()

After performing sentiment analysis, it was found that 63.4% of the tweets were neutral, 22.8% were positive and 13.7% tweets were negative. Among the collected tweets, the majority were not expressing the polarized views.

Ethical Dilemma and Limitations

The simple analysis done with the tweets associated with the hashtag #globalwarming gave us a tentative idea about the opinion of people regarding global warming on Twitter. However, the analysis was done with only a limited number of tweets. Therefore, it may not represent the opinion of a larger set of populations. A large number of tweets can be collected to make the study more robust but there is always an ethical concern related to it. The twitter user’s consent for data collection is only based upon the terms and conditions that they signed with the platform which most users may not read line by line. Therefore, it is difficult to say if users are even aware that their data are being collected and analyzed. When a lot of data is collected, it is also not possible to contact each user for consent.

Additionally, using only one hashtag(i.e., #globalwarming) may not retrieve all data related to global warming. There is a possibility that opinions related to global warming were posted using a different hashtag or even without using one. In such cases, the sample data collected from twitter is biased and will not provide accurate findings. Only the tweets posted in the English language were collected which indicates that the findings only represent the subsection of the population who use the English language as a medium to tweet.

--

--