How to build a simple Stock Movement Classifier

This post shall explore the steps involved in building a Stock Movement Classifier.

Zain Farrukh
Analytics Vidhya
8 min readJan 4, 2021


Is it possible to master the financial markets? This is an age-old question that plagues humanity since the advent of the stock market back in 1602.

Business Case

Since the advent of financial markets, humankind is trying to predict the market to have market psychology work in their favor. As technology is getting more and more advanced, we have the unprecedented computational power to uncover hidden patterns in financial data that were eluded to traders up until today. The purpose of this project is to harness this computing power by building an ML classifier to predict whether a stock will go up or down on any given day. This is particularly useful for investment managers so that they can have better predicting power in their investment management practices.

In this post, we shall attempt to answer the following questions as a first step towards our journey to create trading strategies using the latest AI/ML and Data Science tools:

Q1. Whether historical price and volume data can serve as potential features to predict the respective stock prices?

Q2. Whether overall historical stock market volatility has any correlation with the price data of a given stock and hence whether can be used as a feature as well?

Q3. Whether we can use supervised ML classifier Models to predict whether the stock price will go up or down on any given day?

Please note that this post does not contain any investment advice. Investments are subject to market risk and any investment decision should be made considering those risks. The objective of this post is to inform the readers about the typical process to follow to create a classifier. Many processes not mentioned in this post should be carried out before deploying the final model in live trading environment.

To check the predictive power of our ML model, we will ask our classifier to predict whether, at any given day, the stock will go up or down and then compare with the actual movement of the stock at the end of that day and then will try to calculate the accuracy of the model.

I have used Python and its numerical libraries like pandas and numpy and Jupyter notebook for this analysis.

Process adopted

To answer our above questions, we will explore Google stock (ticker: ‘GOOG’) prices data. We will adopt the following approach to create a stock classifier:

  1. Get the pricing data
  2. Clean and visualize the data to identify any missing and outlier data
  3. Create the potential features which can be used to train the model
  4. Check the correlation of the features to view how closely features are related to each other
  5. Splitting the feature data into training and testing split and then checking need to scale
  6. Training the Classifier on training data using sci-kit learn
  7. Testing and measuring the performance
  8. Tuning the hyperparameters
  9. Comparing the models and validating the model on out-of-sample data

Data Understanding

We will use the Yahoo Financials library to get the stock market data. We will use price data from 1st July 2019 to 1st October 2020, and split the data into 70:30 ratio with 30% to be used to test the model. You can refer to its documentation from the following link:

Clean and Visualize

After the data has been obtained, it should be checked for any outlier and checked whether it does contain any missing values or not. We can use pandas in-built ‘describe’ method to check whether data is normally distributed and whether it has any outliers. We can also count the missing values using pandas ‘isna’ method. We can also see the data visually to identify any anomaly.

Volume data
Price data

Once the data is nice and clean, we can move on to create features that we think would be relevant to answer the questions we identified above.

Feature Engineering

This is the step in which we can iterate and reiterate to arrive at the features which make economic sense as well as can be used as good predictors for future price movements. In this post, we will use the following features to train our classifier model:

  1. Open: Opening Price of any given day
  2. Volume: Prior day volume of the stock
  3. SMA_20: Simple Moving Average of the 20-day window
  4. Std_dev: Standard Deviation for the 20-day window
  5. Band_1 : Bollinger band created using SMA_20 (+) Std_dev
  6. Band_2 : Bollinger band created using SMA_20 (-) Std_dev
  7. ON_returns: whether there was an up or down move from prior day closing price to current day opening price
  8. dist_from_mean: How much distant stock prices are from the mean
  9. vix_data: CBOE Volatility index price from the prior day

You can check my GitHub repository to access the full notebook:

Are the features correlated?

Once we have all the necessary features, we can use a correlation matrix to check the correlation among the features.

Correlation matrix — features

Whether historical price and volume data can serve as potential features?

Yes! We can see that historical pricing data like opening price and moving average are strongly correlated (above 0.8) with the closing price. Since they are correlated, they can be used to predict future prices.

Whether market volatility has any correlation with the prices of a given stock and whether can be used as a feature?

Again, yes. The volatility index has a negative correlation (-0.23) with the stock price and it is very strongly correlated to stock’s volatility as well as its volume. This answers our second question as well and therefore we will use this as a feature for our classification model.

Scaling and splitting the features

Now as we are done with creating the features, will move towards splitting the features. I have decided to split my data into training and testing sets at the ratio of 70:30.

Training the classifier

Once we have scaled and split our data, we will feed it to our selected classifier model for it to be trained on that data.

I have selected Random Forrest Classifier for this purpose.

You can check my GitHub repository to access the full notebook:

Once the model is done training, we can check its accuracy on the testing data.

Confusion Matrix

Tuning the parameters

The accuracy of 52% is not that great. So I will now try to change the parameters of the model to see whether I can arrive at better accuracy.

Accuracy scores at different tree sizes

Comparing Models

As you can see that I was able to increase the accuracy of the model to 58.42% by tuning the parameters. But now I will see whether any other classifier gives a better prediction.

I will now use the Support Vector Classifier to see whether I can get better accuracy or not. Using the RBF kernel, I was able to get an accuracy of 59.55%.

Certainly better than the previous model!

Confusion Matrix of Support Vector Classifier

We now have our model which is tested to be 59.55% accurate. This model is better than betting in stock blindly purely based on chance. However, we still need to test our model on out-of-sample data to ensure that our model is not overfitting to in-sample data.


Since the Support Vector Classifier gives better accuracy, I will use it with validation data to check the out-of-sample accuracy of the model. For the model, I will use pricing data from 2 October 2020 to 31 December 2020. Validation results are as follows:

Confusion Matrix of out-of-sample data

So, can we use ML Classifiers to predict the daily stock movement?

Yes, the option is worth exploring as we are able to get an accuracy score of 63.44 which is certainly more than predicting a mere coin toss. Furthermore, our model has more accuracy on out-of-sample data. Which is encouraging. However, it is still far from ready to be used in the live trading environment. But we can certainly explore the option to use the ML classifiers for better stock movement prediction.


In this post, we tried to explore whether ML classifier can be used to predict the daily movement of stock. We tried to answer whether historical pricing data can be used to predict the movement and whether market volatility (CBOE VIX, used as a proxy) can be used as a feature to predict the movement. Using the correlation table, we were able to ascertain that prior prices and closing prices have a strong correlation which shows that prior prices may have some predictive power. While VIX was negatively correlated with the prices which is also a unique feature that can be used to predict the stock movement.

Finally, we checked the accuracy of some of the classifier models to test whether they can be used to predict the stock movements. Based on the preliminary exploration of the models, it does look like that under some conditions, we can use the ML classifiers to get the expected movement of the stock prices.

We can perform several steps to improve the classifier model. We can explore additional features to try to make our model more accurate. We can also build stop loss around this model before taking this model into production which will increase the likelihood that our model would be profitable in the long term. Finally, we should back-test our strategy to check whether it provides profitable results.

This analysis has given us several key leads that we can follow to make a better investment decision using the latest ML techniques.



Zain Farrukh
Analytics Vidhya

A finance professional with keen interest in solving business problems using data science.