Can People’s Sentiments on Social Media Predict Stock?

Published in

Social Media: Theories, Ethics, and Analytics

6 min readNov 17, 2020

The successful prediction of a stock’s future price could yield significant profit. However, many people believe that stock market is not predictable. Why? Because stock market is the everyday summation of human opinions and world events that investors observe and react to on a minute by minute basis. There are too many factors to take into account that affect stock price. Even so, tons of people are still obsessed with building models that can capture those complicated features of stock market. From my perspective, there are three reasons why people keep doing so: 1) along with development of technology, many new comprehensive models are perform well and can be tested on stock market; 2) perhaps a missing part of this jigsaw puzzle can be found; 3) the power of money. We already knew that sentiments of people is a strong factor that can change the trend of the entire market, but how can we know people’s sentiment without asking them? The answer is social media. People post their thoughts and opinion about stock market on different social media platform every day, and to some extent those posts react their sentiments on investing in the stock market. The purpose of this article is to use people’s sentiments to predict stock change.

Sentimental Analysis

Sentimental analysis is also known as opinion mining or emotion AI, which is one branch of natural language processing. Basically. it uses machine learning algorithms that focus on classifying posts into positive, negative and neutral polarity. There are many popular pretrained sentiment analysis models provided in various Python NLP libraries such as NLTK’s Vader sentiment analysis tool and Textblob’s Sentiment Analysis.

The process of how sentiments affect stock market can be both direct and indirect. Direct process is when people post negative or positive opinion for a specific stock or stock market on social media. On the other hand, indirect process is when people post opinions about events that closely link to stock market. For example, the stock market kept falling during the time people were extremely panic on pandemic. Through searching and analyzing a great amount of data on social media platforms or microblogging websites, we can understand what are people regarding about a particular stock. In this article, I will talk about how to use people’s opinions on Twitter to predict Amazon’s stock change.

Data Collection

Since a great amount of data are needed in this experiment, official API tool such as Tweepy is no longer practicable because it has restrictions on number of tweets extracted. Instead, Twint is used for historical tweets extraction as it can bypass restrictions Twitter sets. Because indirect events that could affect Amazon stock are very difficult to trace, only direct tweets are focused here. The stock ticker is the most important keyword when search for tweets that talking about Amazon stock — $AMZN. The basic usage of Twint package in python is as below. Date range is from 2010 Jan to 2020 Oct. It took a while to extract data because the date range is quite long. In total, 1,453,758 tweets are downloaded between start and end date.

import twint
def get_tweets(start_date, end_date, keyword1, keyword2):
    c = twint.Config()
    c.Search = keyword1, keyword2
    c.Since = start_date
    c.Until = end_date
    c.Hide_output = True
    c.Lang = 'en'
    c.Count = True
    c.Store_csv = True
    c.Output = 'data.csv'
    twint.run.Search(c)

get_tweets('2010-01-01', '2020-10-31', '$amzn', 'amazon')

Another dataset is the daily stock price of Amazon. I used Wharton Research Data Services python package and extracted data from Compustat dataset. An easier way to get stock data is to download from Yahoo Finance.

Data Cleaning

Some steps are required before model building. First, tweets have to be cleaned. I defined two functions to remove patterns and urls in each tweet.

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt

def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)

Remove “@user” and “#hashtag” because @user and #hashtag have nothing business with sentiment:

data['tidy_text'] = np.vectorize(remove_pattern)(data['tweet'], "@[\w]*")
data['tidy_text'] = np.vectorize(remove_pattern)(data['tidy_text'], "#[\w]*")

Remove urls:

data['tidy_text'] = np.vectorize(remove_urls)(data['tidy_text'])

Remove special symbols:

data['tidy_text'] = data['tidy_text'].str.replace("[^a-zA-Z]", " ")

Remove words that have only one letter:

data['tidy_text'] = data['tidy_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>1]))

Tweets after cleaning:

We know that people post multiple tweets per day, but there is only one price for stock each business day. It is necessary to club tweets as per their date.

def ab(df):
    return','.join(df.values)

data = data.groupby('date')['tidy_text'].apply(ab)
data = data.reset_index()

Sentimental Analysis

In this experiment VADER( Valence Aware Dictionary for Sentiment Reasoning) sentiment analyzer is used because it can provide not only polarity score but also the intensity of emotion. Besides, it is very easy to use as it’s available in the NLTK package and can be applied directly to unlabeled text data. VADER takes in a string and returns a dictionary of scores in each of four categories: negative, neural, positive and compound, which is computed by normalizing the scores of the first three.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
def sentiment_scores(sentence):
    # Create a SentimentIntensityAnalyzer object.
    sid_obj = SentimentIntensityAnalyzer()
    sentiment_dict = sid_obj.polarity_scores(sentence)
    neg = sentiment_dict['neg']
    neu = sentiment_dict['neu']
    pos = sentiment_dict['pos']
    compound = sentiment_dict['compound']
    return neg, neu, pos, compound

After applying VADER, we can use compound score to separate negative and positive tweets. In this case, the dataset is highly imbalanced. It seems people like to post positive opinions about Amazon stock and gives a side proof to why Amazon stock keeps rising in the past ten years.

Merge Data

Now both dataset are ready to merge together using their date. Because this experiment focuses on predicting stock change instead of stock price, I set a new column named pos_neg contains stock change information. If the open price next day ≥ the adjusted close price the day before, pos_neg equals to 1, and vice versa. After dropping weekends that have no stock data, the data looks as below.

Model Building

This problem is transforming to a binary classification. Labels are -1 and 1. Because the index is date, we cannot simply split data randomly. I manually assigned train data from 2010–01–01 to 2020–03–31, and test data from 2020–04–01 to 2020–10–31. After running different model such as SVM, neural network and logistic regression, the best test accuracy I got is 0.6376.

from sklearn import metrics
from sklearn.neural_network import MLPClassifier

NN_model = MLPClassifier(hidden_layer_sizes=(300,200,100), max_iter=1000,activation = 'relu',solver='adam',random_state=0)
NN_model.fit(x_train, y_train)
y_pred = NN_model.predict(x_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print(accuracy)

Limitation

Although the accuracy is not considered “well”, it gives us an idea that there is a relationship between sentiment and stock market. Many fields in this research can be improved. Other sentimental analysis tool such as Textblob can generate different sentimental score and subjectivity score that can be taken into account in this model. However, it is still difficult to extract precise information from tweets. For example, people invest on Amazon stock may not use Twitter and people who use Twitter to post opinions may never invested in Amazon stock. Another bias in this research is that it is hard to tell if the negative or positive opinions on Twitter induce the fall or rise in stock or the other way round. The causality is very complicated in this case. A solution is to use other social media such as Stocktwits and Facebook as control groups to compare the sentimental change dynamically.

Can People’s Sentiments on Social Media Predict Stock?

Sentimental Analysis

Data Collection

Data Cleaning

Sentimental Analysis

Merge Data

Model Building

Limitation

Written by Francis Zhang