Machine Learning for Stock Price Forecasting (1/3)

This post will take you inside the works of a 4 month project on developing a machine learning algorithm for stock predictions under the supervision of Schulich Professor Zhepong (Lionel) Li. This is part 1 of a 3 part series. All code is written in python. Data science and model construct performed using Scikit-learn, numpy, and pandas packages.

  • Part 1 — Overview: Using Machine Learning to make data-driven decisions
  • Part 2 — The Math: Applying Supervised Machine Learning (code attachment here)
  • Part 3 — The Finance: Inferences in Stock Behavior

Why I did this?

Since my late high school years, I’ve reaped a deep interest in financial markets. I’ve tried many different strategies and put a lot of thought trying to come up with an adequate strategy to consistently make money by investing in the stock market. I’m a believer that over the short term (under 1 year) stock prices move in wave patterns — understanding these, can help us understand stock price movements. The efficient market hypothesis tells us that all relevant information is already factored into a stock’s price, meaning that neither fundamental or technical analysis can be used to achieve superior gains in the short and long-term. Lots of research has already gone into figuring out how stock prices move like here and here. This experiment challenges this. My hypothesis is simple: By mining for patterns in data using supervised machine learning techniques, I can construct a model and trading strategy that beats the market. After all I am only following the plethora of algo and speculative traders that continue to exploit the market in the short term. This has always been a passion topic for me, so here goes nothing…

Data Collection

The training data used in my project was collected from Quandl Database. I have used one of my favorite stocks, Amazon (NYSE: AMZN) to model the algorithm. The data contains daily stock information from 12/1/2000 — current (because I’m on the free plan, the data is always, at least, one week behind). I’ve used 80% of the data for training and 20% for testing.

Summary of data inputs

In essence, I will apply binary classification to produce a bull or bear signal at time t (where t = 1,2,3…t). To make things short, I used daily labeling as follows: label “1” if the next day closing price is higher than that of the previous day, otherwise label “0”. Finally, I will apply this sequence to multiple models to evaluate performance. The most important metric for measuring performance is accuracy which I have defined below.

Accuracy = Number of days model correctly classified the testing data / Total number of testing days


It’s easy to observe that some models are steadier than others. Keep in mind, only 9 features were used to construct this model, and adding more information-rich features is one way to cancel out the noise and separate the good models from the bad.

The longer the forecasted horizon, the more accurate our predictions become. It’s to no surprise that next day predictions are not much better than the odds of correctly predicting a coin toss (observed by tracking the null accuracy). The Null accuracy is inversely related to the forecast period. The spread between the null model and the other models help describe how efficient each model is.

Trading Strategy

Professional quant traders on Wall Street and Bay Street achieve up to 55% accuracy in predicting next day stock prices, and up to 80% accuracy in predicting stock prices 30-days out. The addition of more features can help me filter the noise to reach these levels. For the time being, I have devised a trading strategy based on the current model to analyze how well I would do in the stock. Since the SVC had the highest accuracy levels, I’ve used it to model this strategy. The model will tell me to sell the stock if next day prices are going to decrease, buy if next day prices will increase. My model was able to produce an average ROI for 1.3% per month from 2008–2010, equal to 31.2%. Similarily a buy and hold strategy in the S&P500 during the same time frame would have resulted in -6.9% in the same time frame (due to recession) and Amazon’s stock would have performed 152%. My model does not produce the best results, but I do believe it’s a strong start. Stock prices during this time frame were greatly influenced by macroeconomic factors which my model does not factor in. Better returns will come with more diverse and information-rich features.

Challenges and Shortcomings

  • As stock prices behave differently, stock prediction accuracy will also vary. As much as you challenge it, some data sets are better to work with than others.
  • The current model is narrow is scope and does not factor in implications of macroeconomic indicators such as Real GDP, inflation, or interest rates which can have drastic effects on some stocks. 9 features are certainly not a representative of a stock’s price (but it did allow us to draw some conclusions).
  • Special events like quarterly earnings and not accounted for. The use of NLP to measure analyst report sentiment can be used to provide an indication on the stock’s direction.

My model only brings us a step closer to understanding stock price behavior. By improving the model we can achieve superior accuracy levels and invest more confidently - all backed by data-driven insight.

If you’re a financier who is passionate about financial markets or a tech talker/data diver/machine learning maniac/#whateveryoucallyourself, feel free to get in touch. The opportunity to collaborate on advancing the model further can get us a step closer to reaching true alpha in understanding the market.

Connect with me on LinkedIn/follow me on Medium/fork me on Github