SAM: Strategic Asset Manager

Published in

SFU Professional Computer Science

10 min readApr 20, 2020

By Anuj Saboo, Ankita Kundra, Rishabh Jain

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

1. Motivation & Related Work

Financial markets investment decisions are more than just crunching numbers. It is tough for the majority of us without any formal training to gain the necessary information to make investment decisions. An uninformed investor has various questions on where he should put money and how much should he risk. Hence, an intelligent system is required that can make use of the hypothesis that stock market prices are a function of information, rational expectations and the newly revealed information through news and financial reports about a company’s prospects. Therefore, we have an opportunity to leverage the power of machines to build intelligence harnessed from symbiosis of numerical and textual features.

Stock Prediction has been a famous problem but it isn’t solved yet, else the richest person in the world would not have been Mr. Bezos. Though there is a lot of work focusing on building models that can predict prices for the next day, there are shortcomings in the work for forecasting prices for a longer time period. Being able to forecast future values, and using these to further forecast values ahead is the strategy we adopted to help investors with their financial decisions. This strategy overcomes the low significance of a next day prediction for an investor who needs information bound on a longer time frame to make investment decisions. SAM is able to guide investment strategy by being able to analyse the trends of the market and help you decide BUY and SELL strategies to maximize profits.

2. Problem Statement

We aim to build a system that can evaluate an investment decision taking into account the stock’s historical performance, global news sentiment and company’s Edgar reports. While doing so, we have a few hypothesis that we aim to confirm. The questions we try to answer are:

Q. How can machine learning suggest investment decisions?
Q. How do changes in a company’s annual reports reflect a change in the company itself?
Q. How do uncertainty, sentiments and emotions help in analysis and prediction?
Q. Do global news and economic indicators play a role?

2.1 Challenges

Processing Edgar reports poses a huge challenge due to the size of the files and the variability in syntax differences among companies in their reports. [1] helped us gain a formal understanding of the processing of these files to download & process them.
It was tricky to merge our analytics and ML work with a AWS backed chatbot into a single application to provide a fluid user experience in making stronger investment decisions.
There can also be many short term factors that influence a company’s immediate stock price which is not easy for the model to capture accurately. In addition to this, feature performance varies for each stock and there cannot be a single solution to forecast stock prices for all the companies. Attention and efforts are required to hyper-tune prediction models for each company to capture insights and make accurate predictions.

3. Data Science Pipeline

To understand the pipeline defined above, we can break it into 4 components:

1. NLP on Edgar:

Using the quarterly IDX files, we were able to generate a master file for the companies of our interest. This file had the location of Edgar reports filed by the companies which were programmatically downloaded from Edgar servers. These files were pre-processed eliminating the HTML formatting instructions among others leading to a reduction in their size by upto 50%.

Uncertainty reflects a company’s imperfect or unknown market factors whereas sentiment would involve its positive and negative outlook. Both the features were generated to calculate a polarity score and uncertainty score used as features in the model.

Particular sections of the document had to be extracted to run other checks to test our hypothesis. Legal Proceeding section had to be extracted to perform text similarity and find if the changes in this section over the years reflect a change in the company itself. Similarly, Management’s Discussion and Analysis section was extracted to analyse emotions of the management’s outlook.

2. Data and Machine Learning

The stock price data extracted from Yahoo Finance API consisted of open, close, high and low features. All these four were averaged to calculate the mean price for the day. All the NLP features were combined with this price to prepare a time series data. After evaluation, LSTM model was chosen to make the predictions because it resulted in a lower RMSE compared to XGBoost. The model was hyper-tuned for parameters such as lookback days, batch size, optimizer etc. to get a better accuracy of predictions. Features such as economic indicators had to be dropped from the model to get a better output.

3. News and Wikipedia scraper

Global news was accessed using Google Cloud Platform and mined using BigQuery. Sentiment processing was done on 100 daily articles for data over 5 years to generate the global news sentiment feature. Introductory paragraphs and logos were scraped from wikipedia for S&P 100 companies to allow for a comparison tab in our dashboard that can help us contrast the average stock price and top stock holders for the companies.

4. Chatbot

We designed a chatbot using AWS services to help the user gain more information from a company’s Edgar report. We used AWS Comprehend which is a natural language processing service to find insights and relationships in text using machine learning. AWS Lex was used for building conversational interfaces into the application. BERT was hosted on AWS EC2 and files were stored on AWS S3 which were used to answer user questions on Edgar reports. AWS Lambda was used to run code without provisioning or managing servers and acted as the central co-ordinator between all the components to work with Lex and deliver the output. The UI was provided by Kommunicate IO and the javascript was embedded into the application.

4. Methodology

4.1 Data Collection

Stock Data: Yahoo Finance API was used to extract stocks data for each company from the year 2015–2019. It was then stored in a PostgreSQL database and merged with company information scraped from wikipedia. Data for top 30 mutual funds was also accessed through the API and stored in the database.

Edgar Reports: The 10-K files from 2014–19 were accessed from Edgar servers. A total of 483 10-K reports were processed and analysed.

Economic Indicators: Data for leading indicators such as BCI(Business Confidence Index), CCI(Consumer Confidence Index) and CLI(Composite Leading Indicator) were downloaded from OECD web portal.

4.2 Text Similarity

We used regular expressions to extract sections of interest from Edgar reports. For finding a change in company’s legal proceedings, a cumulative text similarity was applied over the years using Jaccard Similarity, Cosine Similarity and fasttext’s pre-trained model accessed using Gensim.

Jaccard Similarity has an inherent flaw due to which as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics. Cosine Similarity, on the other hand calculates similarity by measuring the cosine angle between two vectors. This approach is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

During our evaluation, we found the feature cosine similarity to give more accurate results compared to the similarity obtained by using a pre-trained fasttext model.

4.3 Sentiment analysis

It is the interpretation and classification of emotions(positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands or services. We applied sentiment analysis on 10-K filings to calculate the report’s polarity for the specific company and year.

4.4 Emotion Analysis

We extracted the Management’s Discussion and Analysis section to analyse the management’s outlook. To do so, the Sentiment and Emotion Lexicons developed by the National Research Council of Canada were used to give us text association with certain categories of interest such as joy, trust, fear etc. This can help us evaluate if the management is happy with, angry at and fearful of the market positioning and their targets.

4.5 Machine Learning & Forecasting

All the features from the Edgar reports, sentiment analysis on news data, mean price of financial instruments and economic indicators were combined to create data for time series analysis. The data was split into training and testing set. Before converting it into time series data, the features were normalized using Sklearn’s Standard Scaler.

LSTMs are very powerful in sequence prediction problems because they’re able to store past information. This is important in our case because the previous price of a stock is crucial in predicting its future price. We modeled a neural network using Keras with two LSTM layers, two dropout layers and a rmsprop optimizer with a dense layer for the output.

Generally a lookback value ranging between 20–30 days was used depending on the model’s evaluation for a particular company. The approach is to generate a prediction for one future time step using the 30 past values, adding the new prediction to the array and removing the first entry from the same array to predict the next time step with an updated sequence of 30 steps. Predictions are made for 90 days window to evaluate the returns from a financial instrument.

4.6 Chatbot

BERT(Bidirectional Encoder Representations from Transformers) is used to perform a wide variety of NLP tasks including question answering among others. Here we used a pre-trained model (BERT-Large), trained on SQuAD v1.1 dataset to answer our specific questions from the company’s Edgar reports. BERT is deployed on an EC2 instance and interacts with AWS Lambda function to provide the answer to the chatbot via AWS Lex.

Entity recognition is the process of identifying particular elements from text such as names, places, quantities, percentages and times/dates. Identifying the general content types can be useful to analyse the data in Edgar reports to compare them over the years and find changes.

Key phrase extraction can be used on a string containing a noun phrase that describes a particular thing. It generally consists of a noun and the modifiers that distinguish it. Each key phrase includes a score that indicates the level of confidence that AWS Comprehend has that the string is a noun phrase. Scores provided by Comprehend is then used to determine if the detection has high enough confidence and the top 10 values are returned.

5. Evaluation

The features from the Edgar reports and features from sentiment analysis from news data does have an impact on stocks prediction. We compared our LSTM model against predictions that we got from using the mean prices as the only feature. RMSE and prediction for 90 days were used as a factor to evaluate the model. Depending on the model prediction, model hyper-parameters were tuned to get the accurate prediction.

The model is able to follow the trend of the stock prices giving us an indication if the market will go above or fall in the future so that BUY and SELL strategies can be made. A single new prediction is made using past 30 days and a rolling window is applied to get the next prediction each time having the previous 30 days as an input.

6. Data Product

Below is a demo of our user interface where we can run analytics and interact with SAM.

Video 1: Frontend Demo

7. Lessons Learnt & Future Work

This project allowed us to explore the background of financial markets and experiment with factors that may influence price forecasting. We were able to apply NLP techniques to work with metrics of text similarity, sentiment analysis and emotion extraction. The machine learning cycle is an iterative process to experiment with a variety of features and parameters which guaranteed us to build a stronger knowledge base. Using the services by AWS, we were able to build an intelligent automated system to parse Edgar files and mine relevant information for analysis. We also gained exposure working with Dash which provides an easy to integrate application with Python.

In addition to this, we can conclude that more work can be done in the same field to generate better features. A company’s 10-Q filings can be equally important to have more accurate predictions. Also, company specific news should play a better role in influencing the stock price and we would look at ways to gather this information for better accuracy.

8. Summary

Our machine learning approach uses NLP features generated from Edgar reports, global news sentiment and historical price data to forecast future values. LSTM model was used in conjunction with a rolling window approach to forecast 90 days values. Based on the returns, BUY and SELL strategies are then offered to the investors. SAM provides an easy to use interface to make investment decisions. It allows us to analyse a company's historical performance as well as compare its uncertainty and emotion results. Live executions of AWS services makes it possible for it to mine NLP features as well as answer user questions on Edgar reports based on a pre-trained BERT model. We have achieved knowledge from this project with a future scope of further building new features and hyper-tuning models. The problem of stock prediction is far from over, still more features can be analysed to give a stronger result and capture short term volatility to secure investments.

References

Ashraf, Rasha, Scraping EDGAR With Python (June 1, 2017).
Journal of Education for Business, 2017, 92:4, 179–185. Available at SSRN: https://ssrn.com/abstract=3230156

2. Time Series Forecasting: A Deep Dive

3. AWS Documentation: https://docs.aws.amazon.com/