STOCKZILLA

One-stop solution to stocks portfolio generation using unsupervised learning techniques.

Published in

SFU Professional Computer Science

18 min readApr 20, 2020

The Team: Abhishek Sundar Raman, Anchal Jain, Amogh Kallihal, Gayatri Ganapathy

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit here.

1. Motivation and Background

Once a Wise man said-

“Rule 1: Never lose money in the Stock Market!

Rule 2: Never forget Rule 1 !”

Yes friends, you got that exactly right. Sound knowledge of stock market data and the latest news trends are essential for a financial advisor or an investor for making investment decisions in the stock market. The stock market is a volatile domain, and the stock trends keep varying over time in response to multiple factors. These factors include emotions of people buying and selling the stocks, common news sentiments of the stock market, company news and also a staunch belief that the company will do well considering the fact that it has been doing good in the past few months/quarters. Inspite of knowing all this, sometimes the stocks we buy may plunge just because of news about COVID-19 that caused a negative sentiment across the market or skyrocket the values of stock because the company made profits better than the analyst's prediction for that time period. Identifying the right set of moves to generate a diversified stock portfolio that mitigates the risk of financial loss is still an area of active research, thus forming the motivation for our project.

2. Problem Statement

As we began our exploration in understanding the world of the financial stock market particularly about the generation of the stable stocks portfolio through the use of state of the art data science techniques, we tripped ourselves on a few sets of questions. Here are the questions that we intend to find answers for, as a part of our project:

Question 1: Whether we can generate stable stock portfolios using various clustering methods for a quarter/period of time?

Question 2: Is it possible to do asset allocation using stock portfolios obtained through clustering methods?

Question 3: Can we use NLP methods to see whether the news about the companies has any impact on stocks portfolio generated in the same period of time?

Proposal — We propose to answer the above questions with the help of an interactive web UI frontend and a set of Jupyter notebooks which any potential stock investor can refer to understand the various aspects of the stock market such as:

Trends about the individual sector of the stocks.
News sentiment for the companies over the two year period.
Individual stock technical indicator variations and it's Open, High, Low, Close and Volume stock movement over each quarter.
Set of companies that had similar behavior over a two year period of time.
Quarter-wise portfolios for users to see the changing pattern with and without considering sentiment feature.
Different stock portfolios with and without considering sentiment feature and visualization of volatility versus returns that we have validated for the entire two year period.

Thus giving the end-user the flexibility to choose the best stock portfolios, which can maximize his investment profits.

3. Data Science Pipeline

Figure 1 shows the complete data science pipeline we have chosen to address the above questions. Now let us get a deeper understanding of the different stages of our pipeline.

3.1 Data Collection

This stage includes data acquisition and storage of OHLCV and news data into the Cassandra database.

3.1.1. OHLCV Data Collection — We used the publicly available Alpha-Vantage API, which provides us with the historical data for all the S&P 500 companies. We gathered the data of 200 companies from various sectors namely Industrials, Health Care, Information Technology, Consumer Discretionary, Utilities, Financials, Materials, Real Estate, Consumer Staples and Energy for a period of two years starting from April 2018 to March 2020. We employed the standard 3–4 API requests per minute and within 500 API requests per day for our calls. This data contains information of the open, high, low, close, volume, adjusted close, dividend amount, the split coefficient on a given day. Data was stored along with its corresponding ticker symbol name and the industry that it belongs to for our analysis.

3.1.2. News Data Collection — We collected the news data from the New York Times API, which is free for non-commercial use. The articles from the New York Times API were queried using Articles API and Archive API. Articles API was used to query data about the specific organizations over the identified period. Archive API was used to fetch common news articles for understanding the overall common emotion of the stock market in areas like financials, economics, business, and investments. These queries have to be specific and can return up to 2000 articles per query in 10 pages. The NYTimes API allows 4,000 requests per day and 10 requests per minute.

3.1.3. Data Streaming — In our project, we used Kafka’s publisher-consumer model to populate the database. Kafka producer queries the data from AlphaVantage API and the New York Times API and publishes them on different topics. The consumer then reads the data on these topics, formats the data as per the table structure in the database, and stores it into the database. This data flow allows us to decouple the pipeline and seamlessly store the data in the database.

3.1.4. Data Storage — The data read by Kafka consumers from various topics was stored in Cassandra. Since OHLCV and News feature needed data for a time window, the data from Kafka consumer was directly stored into Cassandra first. Since we had to perform bulk writes from the Kafka streaming pipeline as well as ensure fault tolerance in the data storage, we chose Cassandra as the database.

3.2 Data Engineering and Analysis

3.2.1. Data Cleaning and Pre-Processing

The news data articles obtained from the New York Times API are read from Cassandra through Apache Spark. We performed text data cleaning by removal of the stop words, non-English characters, non-ASCII characters, hyperlinks, and special characters. This is followed by tokenization and lemmatization.

3.2.2. Feature Generation

We generate features from the OHLCV data as well as News data obtained from the data sources. For each of the companies, we aggregate the data in a given quarter to generate these values. Technical indicators such as trend indicator — exponential moving average (EMA), volume indicator — Chaikin oscillator (ADOSC), volatility indicator — average true range (ATR), momentum indicator — rate of change (ROC) and relative strength index (RSI) were identified to represent the OHLCV data for our analysis after understanding each of their significance with respect to the financial market data. These technical indicators were generated with the help of TA-Lib (a library used to perform technical analysis of financial market data). Along with the above five technical indicators, average returns and Sharpe ratio were also included. News sentiments were generated with the help of NLTK-Vader (Natural Language Tool Kit Valence Aware Dictionary and sentiment Reasoner) from the common news data as well as company news data. A weighted sentiment score was generated with a weightage of 0.25% on the common news and 0.75% for the company-specific news sentiment. Finally, OHLCV derived features and weighted sentiment features are integrated quarter-wise to form the final set of input features to be used for our cluster analysis.

3.3 Machine Learning

In this stage, input features were scaled and transformed into a new set of uncorrelated features using Principal Component Analysis. These new features are then passed through two clustering algorithms K-Means and K-Medoids, for clustering analysis. We later performed cluster profiling to understand the distribution of the various features among the clusters.

3.4 Portfolio Generation

After passing through the clustering techniques, we filtered the stocks based on the best technical indicators which define the nature of the stock. We then pass it through the efficient frontier portfolio generation technique of Modern Portfolio Theory, to select the optimal portfolios and their respective stock allocation for that period. The portfolios returned had the best-annualized returns and Sharpe ratio values for the corresponding risk-free rates at that time.

3.5 Visualization and Deployment

Our interactive web UI is an amalgamation of all the OHLCV plots for each of the sectors and companies, the distribution of the features generated, EDA plots, and cluster analysis with cluster profile. Finally, the stock portfolios are generated for each quarter and over a two year period. This interactive web UI is built using Flask framework and the application is deployed on an Amazon EC2 instance.

4. Methodology

4.1 EDA

Graphs were plotted to understand the influence of the technical indicators on the price of a stock. Figure 2 shows the set of sample graphs plotted for the stock price of Amazon (AMZN) during the period April 2018 — March 2020.

Figure 2: Plot of ADOSC, ATR, EMA, ROC, RSI indicator for Amazon (left to right row-wise)

To further understand our data, we performed a fundamental analysis of how each company behaved over the two year period. Plots of the companies that traded most stocks in volume had the best opening values, best closing values, and daily fluctuations were as shown in figure 3 and figure 4. These visualizations helped us to understand the distribution of top companies based on the various features.

Figure 3: Top 10 companies based on open values (sample)

Figure 4: Top 10 companies based on open, close, daily fluctuation and stock volumes in a quarter(sample)

To understand which companies are more in the news in a given quarter and their cumulative sentiment, we plotted the top 10 companies for every quarter. From figure 5, we observed that top companies vary from quarter to quarter based on the news sentiment.

Figure 5: Quarterwise top 10 companies with the most positive cumulative sentiment starting from April 2018

4.2 Cluster Analysis

To generate portfolio candidates from 200 companies, we had to shortlist companies from this large pool. The first step in this direction was to perform quarter wise clustering using the input features to make clusters of varied profiles. For analysis, we used two clustering approaches, K-Means and K medoids. For both models, clustering was performed in two ways.

OHLCV features (ATR, EMA, ADOSC,ROC, RSI, Sharpe Ratio,Average Returns)+ weighted news sentiment
OHLCV features (ATR, EMA, ADOSC, ROC, RSI, Sharpe Ratio, Average Returns)

To determine the optimal number of clusters, the clusters were evaluated using the Silhouette coefficient analysis and Elbow method. After performing quarter wise clustering, data profiling was performed in two ways.

For each cluster, the mean values of features were calculated that gave insight about the clusters.
For all the quarters, companies were identified that were always together. This helped us in identifying companies with similar behavior over an entire period of two years. Figure 7 and figure 8 show the insights derived using the K-Medoids clustering algorithm. The two different lists show that sentiment feature does play a role in cluster membership across the quarters over a two year period of time.

Figure 7: List of the set of companies that were always together in all 8 quarters(Without Sentiment feature)

Figure 8: List of the set of companies that were always together in all 8 quarters(With the Sentiment feature)

4.3 Portfolio Generation

Modern Portfolio Theory (MPT) is an investment theory developed by Harry Markowitz and published under the title “Portfolio Selection” in the Journal of Finance in 1952, which says it is the set of optimal portfolios that give highest returns for a particular level of risk.

By investing in more than one stock, an investor can reap the benefits of diversification. If there are multiple stocks in your portfolio, then you have to take into consideration these stocks are not exhibiting similar behavior. Following steps were taken to perform portfolio generation:

4.3.1. Stocks Selection — The stocks for each quarter were selected by grouping the stocks from each cluster satisfying the below criteria:

1) Stocks having maximum EMA values which indicate that in recent times it has been doing really well if it is on a positive trend.

2) Stocks having maximum ROC values which indicate that stock is moving positively when compared to its value sometime back. By considering a positive ROC we mitigate the “risk” of losing out on making a profit for those stocks.

3) Stocks whose RSI values lie between 30 to 70 were chosen to ensure that these stocks don’t fall under overly bought categories or overly sold categories and have an optimal trading trend. This was done to overcome the problem of trend reversals.

4) Stocks with the higher value of the ATR and positive ROC were selected indicating that the stock is on an increasing trend and also returns on them can be higher.

4.3.2. Stocks Pair Generation — For the shortlisted set of companies from the above process, we form stock pairs that are negatively correlated and whose covariance is less than the mean covariance to ensure the diversification in the stocks. These pairs were further grouped together by following the same process to form the potential stock portfolios. This was done for companies shortlisted in each quarter as well as for the companies aggregated by the stock selection process over the two year period.

4.3.3. Final Portfolio Selection & Generation — We employed two different techniques for the final portfolio selection.

4.3.3.1. Efficient frontier portfolio generation — The efficient frontier portfolio generation technique uses a random portfolio technique. For each of the potential portfolios, 25000 iterations were performed by assigning random allocation of weights along with calculating the volatility, returns and Sharpe ratio. Figure 9a represents the graph of the efficient frontier technique for the potential portfolio. In the figure, the portfolio with the highest Sharpe ratio is marked as a red star and the portfolio with the lowest Sharpe ratio is marked as a green star. The rest of the portfolios are denoted by the blue dots. The darker the shade of blue, the higher the Sharpe ratio representing the return and volatility of the portfolios. On the efficient frontier, these blue dots form an arc. The portfolios on this arc have maximum returns for the selected level of volatility. The rest of the portfolios which are below the efficient frontier line have lesser returns for the same level of volatility. For any stock to be considered safe, it needs to have low volatility and high returns and must be lying on the efficient frontier line connecting the red star and the green star.

Figure 9b indicates the best distribution of the given stock considering the volatility and annualized returns.

Overall from figure 9, we can see that all those portfolios that lie below the efficient frontier line (marked by the dotted line on the graph) are not optimal as they give lower returns for the same risk. The portfolios that lie towards the right of the efficient frontier line are the ones that are higher risk for the defined rate of return.

Figure 9: Efficient frontier and individual portfolio optimization of the potential stock portfolios

4.3.3.2. Min-Max portfolio generation — Min-max method is a portfolio generation technique that employs reduction to consider only those stocks which have consistently appeared in each of the quarters. This indicates that these stocks have been doing really well over the last two year period. The idea is to consider all those stocks whose appearance count is greater than the (min_appearence_count + max_appearence_count)/2 in the stock selection process over the two year period. This generated potential stock portfolio was passed to the above efficient frontier technique to understand its Sharpe distribution and the stock allocation weights.

4.3.4. Results obtained — Below are the observed results for the various combinations of the portfolio selection & generation procedures explained in section 4.3.3. and their corresponding inference:

1. The efficient frontier portfolio generation technique provided the better-annualized returns and Sharpe ratio values for the stock pairs compared to the min-max technique in 3 cluster combinations of clustering algorithms performed.

2. Portfolios created with K-medoids with 3 clusters and efficient frontier technique have a higher Sharpe ratio in comparison to results obtained with K-means model using Min-Max or efficient frontier technique as shown in figure 10 and figure 11.

3. News sentiment affects quarter wise portfolio creation. Figure 12 and figure 13 shows that portfolios constructed using sentiment features have a higher portfolio Sharpe ratio in most cases.

4. For our final selected K-Medoids clustering model, figure 14 shows that the Sharpe ratio and annualized returns of the portfolios generated for the two year period with sentiments and without sentiments were comparable with each other. This supports the theory of the efficient market hypothesis that OHLCV signals already contain the news effect.

Figure 10: Results (Without sentiment feature)

Figure 11: Results (with the sentiment feature)

Figure 12: Quarter-wise results for K-medoids-3 cluster( Without sentiment feature)

Figure 13: Quarter-wise results for K-medoids-3 cluster( With the sentiment feature)

Figure 14: Suggested Portfolios with K-medoids -3 cluster

4.4 Tools Used

Kafka — For streaming and populating the data.

Cassandra — For data storage.

Apache Spark — For News sentiment analysis.

Pandas — For Data Engineering, Machine Learning, Portfolio Generation.

NLTK-Vader — For sentimental score generation of company-related and common news data.

TA-Lib Library — For Generation of OHLCV Technical Indicators in feature engineering.

Scikit Learn & Scikit Learn Extra- For machine learning — cluster analysis.

Matplotlib, Seaborn & Plotly dash — For EDA, Visualization of data.

AWS, FLASK, HTML & CSS — Visualization and Deployment of a final data product.

5. Evaluation Methodologies Used

5.1 EDA

The EDA task provided a good grasp of the data we collected. Overall Sector performance was analyzed before including it for further analysis. While common economic and business news was available on a daily basis, Company-specific news for all companies was not available on a daily basis. So we created a weighted sentiment feature giving more weightage to company news. This helped in analyzing the impact of company news more effectively.

5.2 Cluster evaluation

5.2.1. Silhouette method — The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b — a) / max(a, b). The score value varies from -1 to 1. A higher value means data points in a cluster are closely matched to their own cluster and farther from other clusters. Figure 15 below shows the variation in the silhouette coefficient over the data of 8 quarters.

Figure 15:Silhouette coefficient analysis for K-Medoids algorithm for all quarters

5.2.2. Elbow Method — The idea behind the elbow method is to run k-means clustering on a given dataset for a range of values k, where k represents the number of clusters. For each value of k, it calculates the sum of squared errors (SSE) as well. After that we plot a line graph of the SSE for every value of k. If the line graph looks like an arm then the “elbow” on the arm is the value of optimal k (number of clusters). We select a small value of k that has low SSE. However, “elbow” cannot always be identified unambiguously.

Figure 16: Elbow analysis for K-Medoids algorithm for all quarters

As observed in figure 15 and 16, while silhouette coefficient scores for all quarters favored 3 clusters, Elbow method scores for different quarters were not that uniform and clear. So we performed clustering using 3 and 4 clusters.

5.3 Portfolio Evaluation

Companies shortlisted from cluster analysis are grouped together as portfolio candidates only if they have correlation < 0.5 and their covariance is less than the mean covariance. We know that higher the Sharpe ratio value, the stocks have good returns with consideration of the risk in the investment. Therefore, portfolios generated from the Min-Max technique and Efficient frontier technique are evaluated based on the Sharpe ratio with the following criteria.

Sharpe Ratio >= 3 implies that it is an “Excellent Portfolio”

Sharpe Ratio >= 2 implies that it is a “Very Good Portfolio”

Sharpe Ratio >= 1 implies that it is a “Good Portfolio”

Sharpe Ratio < 1 implies that it is a “Bad Portfolio”

6. Data Product

Our data product “StockZilla” is a Web UI containing a collection of various visualizations of OHLCV and News data, cluster analysis and portfolio generation. It is a good combination of dynamic and static plots to give the user multiple insights. It consists of 5 main modules as follows:

6.1 View OHLCV plots

This is the first section of the Web app. As shown in figure 17, the user can select a sector, a company in that sector to obtain insights of the Open, High, Low, Close trends for the selected sector and company in a single glance for any selected quarter or over the entire two year period.

6.2 View Feature Plots

In this section, a user can select a sector, a company in that sector and any one of the derived features to be visualized. Similar to the previous section, figure 18 shows the sector-wise and company-specific pattern of the selected feature over the particular quarter and over the entire two year time period.

Figure 18: Features behavior for a company

6.3 Cluster Visualization

This section is the heart of the application. Figure 19 shows the K-means and K-medoids cluster formation with and without news sentiments over any selected quarter. By hovering over any of the stocks in the cluster, the user can see the quarterly trends of the selected feature for that particular stock. This would help the user see similar trends of the stocks belonging to the same cluster and how they vary over different clusters. It also shows a cluster profile table that represents the distribution of the mean values of the features in the formed cluster.

6.4 Portfolio Generation

This section includes the final result of this project. It shows the final portfolios generated using K-means and K-medoids clustering with and without sentiments using three stock selection methods, i.e. Min-Max, Efficient Frontier Top 5 portfolios over two years and quarter wise generated portfolios. Users can use his/her discretion to choose among these stock portfolios which will give them the best-annualized returns. We suggest that stock portfolios generated from K-Medoids and with 3 clusters give the best portfolio.

Figure 20: Portfolio with Kmedodids and 3 clusters (Sample)

6.5 Static Visualization

This section includes various static bar plots and line plots. The bar plots show the top 10 company trends for a selected feature over all the quarters. The line plots show the top 10 companies with the highest feature value in the two year period. Cluster EDA section shows the constant membership of a company over the clusters.

7. Lessons Learnt

This project has provided us with a lot of learning about the financial stock market sector and helped us to understand a great deal to think like a data scientist. We understood that the best way to learn data science is by doing data science. Here is the list of learnings we got from this project:

1. Improved understanding of the financial sector and intricacies involved in the stocks portfolio generation such as :

· Identifying technical indicators that were derived from the OHLCV data.

· Understanding the effect of news data on stock portfolio generation.

· Stock portfolio allocation methodologies.

2. Acquisition and streaming of data on Kafka through the use of publicly available API such as Alpha-Vantage and NewYorkTimes.

3. We learned the use of Natural Language Processing to process textual data and integrating the same with numerical data.

4. During the design of the UI, we learned the use of Plotly Dash app for generating dynamic graphs by reading data from the database.

5. To decide on a database, we had to explore various options like MySQL, HBase, MongoDB and Cassandra. Finally, we found that using Cassandra was the best fit for our application because of its efficiency in bulk writes and highly fault-tolerant nature.

8. Summary

Our project is a one-stop solution to finalize on the right set of the stock portfolio by utilizing historical stock market data and news information. We were able to create a lightweight application that employs the K-Medoids clustering model along with the efficient frontier portfolio generation technique. While the clustering algorithm significantly reduced the number of companies considered for portfolio generation with reduced time complexity, efficient frontier portfolio generation techniques helped us to optimize the stock allocation strategy. We were successful in employing NLP methods to perform text processing and generate sentiments from news data. The web UI deployed is a useful tool for investors and financial advisors saving their time and effort for searching and analyzing data from different sources. Overall, we suggest the users with a diverse stock portfolio having best-annualized returns. The interactive web UI provides the end-user with visualizations about each of the technical indicators and cluster distribution. The suggested stock portfolios allocation based on the efficient frontier portfolio generation technique that enables the users to make informed decisions about the stocks portfolio.

9. References

1. https://towardsdatascience.com/efficient-frontier-portfolio-optimisation-in-python-e7844051e7f

2. https://medium.com/python-data/efficient-frontier-portfolio-optimization-with-python-part-2-2-2fe23413ad94

3. https://www.investopedia.com/technical-analysis-4689657

4. http://cs229.stanford.edu/proj2017/final-reports/5212256.pdf

http://www.iaeng.org/publication/IMECS2016/IMECS2016_pp317-321.pdf