Stock Trend Prediction with Technical Indicators

Feature engineering and Classification model with Python code

Sarit Maitra
Dec 25, 2019 · 7 min read

Predictive model to correctly forecast future trend is crucial for investment management and algorithmic trading. The use of technical indicators for financial forecasting is quite common among the traders. Input window length is a time frame parameter required to be set when calculating many technical indicators.

Here, we will investigate Dow Jones Industrial Average (DJI) data. It is a time series data; the feature space is derived from the time series itself and is concerned with potential movement of past price. Here, I have performed a short term prediction with 1 day window; however, it is always advisable to experiment with a range of window sizes e.g. 3, 5, 7, 10, 15, 20, 25 and 30 days etc. to predict of price trend.

Data is collected through API call to worldtradingdata. Let’s load the data and see how the data look like.

We will be predicting the ‘close’ price. From the plot we can see that the trend is highly non-linear and it is very difficult to capture the trend using this information.

Features of Technical Indicators

Simple Moving Average (SMA)

Formula: SMA = ( Sum ( Price, n ) ) / n
Here: n = Time Period

Exponential moving average (EMA)

Average true range (ATR)

Average Directional Index (ADX)

Commodity Channel Index (CCI)

  • CCI = (typical price − ma) / (0.015 * mean deviation)
  • typical price = (high + low + close) / 3
  • p = number of periods (20 commonly used)
  • ma = moving average
  • moving average = typical price / p
  • mean deviation = (typical price — MA) / p

Rate-of-change (ROC)

ROC = [(Close price today — Close price “n” day’s ago) / Close price “n” day’s ago))]

Relative Strength Index (RSI)

William’s %R

Stochastic %K

ti = ti.dropna() # drop all NaN values

## Total data-set has 12581 samples, and 22 features.

The goal here is to predict (t+1) value based on N previous days information. Therefore, defining the output value as pred_price, which is a binary variable storing 1 when the closing price of tomorrow > today and this way we are turning this into a classification problem.

Each row in the dataset contains the price of the DJI at t+1 and the constituent’s prices at T=t. The idea is to build a model to predict financial market’s movements. The forecasting algorithm aims to foresee whether tomorrow’s close price is going to be lower or higher with respect to today.

The problem considered here are-

  • Regression Predictive Modeling Problem (trying to forecast exact open price or return next day
  • Binary classification problem (price will go up [1; 0] or down [0; 1]).
  • Split the “ti” data-frame to inputs(X) and outputs(y)

In our data-set, we have all the columns except pred_price as inputs and the pred_price column output. We are not shuffling data before splitting because we want to predict future prices by training our model on past data. We have to be careful here while training and evaluating time series data as there can be a high chance of over-fitting (and we do not use cross-validation for evaluation).

It is important we do not randomly pick training and testing samples. We have data from 1970–01–02 to 2019–12–17; we will choose 2014–01–01 as a split day:

  • Training set: data from 1970–01–02 to 2013–12–31
  • Test set: data from 2014–01–01 to 2019–12–17

In this case, no knowledge from the future is used in the training phase and we can use the model to predict the data in the test set.

Normalize data

Classification models

Now, we have used 5-fold cross validation on train set to obtain below scores.

Model Fitting and Results

log_reg = LogisticRegression(solver=’lbfgs’, max_iter=5000)

log_reg.fit(train_x_scaled, train_y

log_reg

output:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=5000, multi_class=’warn’, n_jobs=None, penalty=’l2', random_state=None, solver=’lbfgs’, tol=0.0001, verbose=0, warm_start=False)

Precision is the ability of a classifier not to label an instance positive that is actually negative. Recall is the ability of a classifier to find all positive instances. F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Support is the number of actual occurrences of the class in the specified data-set.

ROC curve

TPR = TP/TP+FN, FRP = FP/FP+TN

Below we see the AUC (area under curve) is 85.99% which provides an aggregate measure of performance across all possible classification thresholds. This means that almost 86% chance that model will predict correct.

Feature importance

The system performance depends on the combination of the window size and the forecast horizon. However, here by learning from past data we are able to get above 75% accurate prediction on the next couple day’s trend.

I can be contacted here.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Sarit Maitra

Written by

Data Science Practice Lead at KSG Analytics Pvt. Ltd.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade