Multivariate​ Stock Market Analysis — Financial + Sentiment variables.

Abhijit Menon
Analytics Vidhya
Published in
4 min readJun 28, 2020

The stock market prediction has been an active area of research for quite a while. However, building a model that takes into consideration every factor is still a challenging problem. Apart from historical prices, the current stock market is affected by news articles about the company, general news, and many other microeconomic and macroeconomic factors. There are several models out there to predict the stocks based on general news elements or historical prices but not many have taken into consideration all company-specific news or all of these factors together.

In this project we aim to see how taking company-specific news can help predict stock prices.

Step 1: Data Collection

Our data collection was split into 3 major chunks. The first was to get the Stock information for the various companies. The second was to collect company-specific news data for the companies that we were analyzing and the 3rd was to get the common news data that included politics, environment, economics, etc.

1 ) Stock Data

The stock market data was collected by calling the Yahoo finance API and collecting the stock data for the 30 companies currently a part of the DJIA index. We collected the data from 2006 to 2016. The data collected included the opening price, closing price, the high and low values for that particular day.

2 ) Company News Data

The company-wise news data was taken off the NYTimes website using a script to generate multiple API calls for company wise data across the years of 2006–2016.

The attributes that were a part of the dataset are -

1) Created Time

2) Snippet

3) Headline

4) Company Name

5) News — Desk ( What kind of news was this -Business, Entertainment, Tech .. etc.)

3) General News Data

The general news data was got from a ready to use dataset found on Kaggle. This data contained news data collected off of Reddit for the top 25 news headlines for each day across multiple years.

Step 2: Data Pre-processing

Once we had collected the data we had top pre-process the 3 datasets and put them together to allow us to use it to analyze the influence of the news data on the stock prices.

From the news data, we were looking to understand what the sentiment of the news for that particular day was. To do this we combined the news across the same day into a single section and took the overall sentiment for that section and allocated that sentiment to the day. We did this for both the company wise data as well as the common news data.

This way our dataset for each day from 2006 to 2016 had the stock data, and two sentiment scores. For each day the target value then became the next day’s stock value.

Step 3: Data Analysis

Figures: Change in stock values of the companies across the years.

Figure: The red line shows the sentiment score and the blue line shows the stock values for the given day. We see in some places how a change in the sentiment score on a given data affects the stock prices.

Step 4: Data Modeling

Now coming to analyze whether our sentiment values affect our stock values by a huge amount. We considered different prediction algorithms initially and then finally settled on Recurrent Neural Networks and also an LSTM(Long Short Term Memory) algorithm, which is a special subset of a recurrent neural network. The reason being, the other algorithms we considered did not account well enough for the time-series nature of our data. (As a part of our coursework and guidelines we were required not to perform time-series manipulations on the data).

The LSTM algorithm required us to manipulate the data into a 3D array to be modeled. After performing this data manipulation, we went ahead and generated predictions with a different combination of our input variables. With some including the sentiment scores and some not. Post this we performed a comparison of the two results.

Step 5: Analysis of Results

Recurrent Neural Network — Only Stock Value and Volume.

Mean absolute error of 1.28.

Recurrent Neural Network — Stock Value, Volume, and Sentiment of common news.

Mean absolute error of 1.56.

Recurrent Neural Network — Only Sentiment columns

Mean Absolute Error of 5.78

Recurrent Neural Network — All columns

Mean Absolute Error of 1.0002.

Recurrent Neural Network — LSTM (All columns)

Mean Absolute Error of 0.45

We noticed that although not significant, the sentiment columns when added into the mix definitely help increase the accuracy of the model. However, given that we have not performed all the pre-processing that is required for a time-series data set, there is a lot of improvement that needs to be made to generate more accurate results and understandings from our dataset.

To get access to the code and the official project report, visit our GitHub link for the project:

https://github.com/akmenon1996/Multivariate-Stock-Market-Analysis

Originally published at https://www.abhijitkmenon.com on June 28, 2020.

--

--