Stock Trend Prediction with Technical Indicators
Feature engineering and Classification model with Python code
Predictive model to correctly forecast future trend is crucial for investment management and algorithmic trading. The use of technical indicators for financial forecasting is quite common among the traders. Input window length is a time frame parameter required to be set when calculating many technical indicators.
Here, we will investigate Dow Jones Industrial Average (DJI) data. It is a time series data; the feature space is derived from the time series itself and is concerned with potential movement of past price. Here, I have performed a short term prediction with 1 day window; however, it is always advisable to experiment with a range of window sizes e.g. 3, 5, 7, 10, 15, 20, 25 and 30 days etc. to predict of price trend.
Data is collected through API call to worldtradingdata. Let’s load the data and see how the data look like.
We will be predicting the ‘close’ price. From the plot we can see that the trend is highly non-linear and it is very difficult to capture the trend using this information.
Features of Technical Indicators
We must have Ta-lib which can make our life easy when it come to technical indicators.
Simple Moving Average (SMA)
SMA is calculated by adding the price of an instrument over a number of time periods and then dividing the sum by the number of time periods. The SMA is basically the average price of the given time period, with equal weighting given to the price of each period.
Formula: SMA = ( Sum ( Price, n ) ) / n
Here: n = Time Period
Exponential moving average (EMA)
Though EMA can be calculated mathematically; but here to keep it simple, I have used python “ewm” function.
Average true range (ATR)
ATR measures market volatility. It is typically derived from the 14-day moving average of a series of true range indicators.
Average Directional Index (ADX)
ADX indicates the strength of a trend in price time series. It is a combination of the negative and positive directional movements indicators computed over a period of n past days corresponding to the input window length (typically 14 days)
Commodity Channel Index (CCI)
CCI is used to determine whether a stock is overbought or oversold. It assesses the relationship between an asset price, its moving average and deviations from that average.
- CCI = (typical price − ma) / (0.015 * mean deviation)
- typical price = (high + low + close) / 3
- p = number of periods (20 commonly used)
- ma = moving average
- moving average = typical price / p
- mean deviation = (typical price — MA) / p
ROC measures the percentage change in price between the current price and the price a certain number of periods ago.
ROC = [(Close price today — Close price “n” day’s ago) / Close price “n” day’s ago))]
Relative Strength Index (RSI)
RSI compares the size of recent gains to recent losses, it is intended to reveal the strength or weakness of a price trend from a range of closing prices over a time period.
This shows the relationship between the current closing price and the high and low prices over the latest n days equal to the input window length.
It compares a close price and its price interval during a period of n past days and gives a signal meaning that a stock is oversold or overbought.
ti = ti.dropna() # drop all NaN values
## Total data-set has 12581 samples, and 22 features.
The goal here is to predict (t+1) value based on N previous days information. Therefore, defining the output value as pred_price, which is a binary variable storing 1 when the closing price of tomorrow > today and this way we are turning this into a classification problem.
Each row in the dataset contains the price of the DJI at t+1 and the constituent’s prices at T=t. The idea is to build a model to predict financial market’s movements. The forecasting algorithm aims to foresee whether tomorrow’s close price is going to be lower or higher with respect to today.
The problem considered here are-
- Regression Predictive Modeling Problem (trying to forecast exact open price or return next day
- Binary classification problem (price will go up [1; 0] or down [0; 1]).
- Split the “ti” data-frame to inputs(X) and outputs(y)
In our data-set, we have all the columns except pred_price as inputs and the pred_price column output. We are not shuffling data before splitting because we want to predict future prices by training our model on past data. We have to be careful here while training and evaluating time series data as there can be a high chance of over-fitting (and we do not use cross-validation for evaluation).
It is important we do not randomly pick training and testing samples. We have data from 1970–01–02 to 2019–12–17; we will choose 2014–01–01 as a split day:
- Training set: data from 1970–01–02 to 2013–12–31
- Test set: data from 2014–01–01 to 2019–12–17
In this case, no knowledge from the future is used in the training phase and we can use the model to predict the data in the test set.
We will train the system with different classification algorithms including tree based & neural network to select the best fitted model for our data set.
Now, we have used 5-fold cross validation on train set to obtain below scores.
Model Fitting and Results
We will try with logistic regression
log_reg = LogisticRegression(solver=’lbfgs’, max_iter=5000)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=5000, multi_class=’warn’, n_jobs=None, penalty=’l2', random_state=None, solver=’lbfgs’, tol=0.0001, verbose=0, warm_start=False)
Precision is the ability of a classifier not to label an instance positive that is actually negative. Recall is the ability of a classifier to find all positive instances. F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Support is the number of actual occurrences of the class in the specified data-set.
ROC curve (receiver operating characteristic curve) showing the performance of a classification model at all classification thresholds.
TPR = TP/TP+FN, FRP = FP/FP+TN
Below we see the AUC (area under curve) is 85.99% which provides an aggregate measure of performance across all possible classification thresholds. This means that almost 86% chance that model will predict correct.
If we view and analyze the relative importance of features, we see the less or no importance of features in below. Therefore, we can re-run the model keeping only the relevant features to obtain new score. You may try dropping the features as marked below and re-run the system. Moreover, you can also try with other classifiers e.g. MLP to check the score.
The system performance depends on the combination of the window size and the forecast horizon. However, here by learning from past data we are able to get above 75% accurate prediction on the next couple day’s trend.
I can be contacted here.