Part 4: Searching for Signals

12 min readMay 11, 2018

In the last post, we discussed how daytrader.ai uses high level patterns to generate training data. In this post we will start to examine and apply some data cleaning and normalization techniques. Finally, we’ll try to find some signal in our data using a Multi Layer Perceptron and then a LSTM model.

Artificial Neural Network Review

For this and future posts we will be diving into machine learning. If you are new to Machine Learning and want to understand the basics to follow along I would suggest the following:

Crash Course On Multi-Layer Perceptron Neural Networks https://machinelearningmastery.com/neural-networks-crash-course/
Gradient Descent in a Nutshell https://towardsdatascience.com/gradient-descent-in-a-nutshell-eaf8c18212f0
Neural Networks and Deep Learning: [Free book] http://neuralnetworksanddeeplearning.com/
Free Preview of our Deep Learning Nanodegree https://classroom.udacity.com/courses/nd101-preview

The first thing we will need to do in order to start the learning process is some data prep.

Vectorizing our Data

The data I have provided contains 2420 minutes of 1 minute stock data. That entry point or high level condition was detected at point 2400 in the set. From there I provide 20 minutes of future history. Therefore, one of the first things we are going to want to do is to split the data into our feature matrix as well as extract our label vector. The following util code will produce a numpy feature matrix with shape M x N where M=2400 and N is the number of training examples.

Visualizing Some Data

At this point it’s a good idea to take a look at some of the data samples. Do they look OK?. I have included a utility method to plot individual training examples. Here is an example:

You can visualize any training example in the training set using the following method.

Centering Data

One of the key challenges in a data science problem like this is how to correctly phrase the problem to the machine. The raw data that I have provided includes closing price and volume for a number of stock symbols. One of the first things to notice is that these stocks are valued very differently. For example Amazon shares sell in the range over 1000 dollars per share, whereas Intel stock is trading under 100 dollars. We are going to have to first normalize these values in order to properly compare them. Here are a few ways one could do this:

Rate of change centering

Here we take the price at time T and divide it by the close price at time T-1 we then subtract 1 from this value.

(price at time T)/(price at time T-1) — 1.0

Eg:

2017–10–17T14:18:00.000Z,201.87,55800.0
2017–10–17T14:19:00.000Z,201.21,137786.0
Price at 14:19 is 201.21
Price at 14:18 is 201.87
(201.21/201.87) — 1.0 = −0.002824859

After this our data will be positive if the price increased from the last tick and negative if the price decreased from the last tick. You also may note that in most cases the price is a small values close to zero. This is due to the fact that we should not see massive price jumps when looking at the 1 minute time domain.

This looks pretty good on the surface, but has some major problems. Think about how a daytrader thinks using ideas of support and resistance. We want to capture the fact that price may return to and then bounce, at certain price points. In the above centering math we lose this correlation since values that have the same price in our data set will no longer be the same value when using the above centering scheme. Instead we want to preserve these values in our new vector so that they have the same magnitude.

Entrypoint Centering

Here we calculate the rate of change with respect to the entry point of our data. Again the entry point for the example data is when the EMA-15 crossed over the EMA-65. This is going to be the last column of price data in our feature matrix. So now we divide each price by this entry point price and subtract 1 from it.

(price at time T)/(ENTRY POINT PRICE) — 1.0

This will again give us values that are positive if the price was above the entry point and negative when price was below the entry point, however we preserve the fact that prices of the same value transform to values that are also the same in our centered data set. This is the centering method that I use and have provided the following utility method to center your own data.

Normalizing Data

In this step we will be using scikit-learn’s Standardization. From the documentation:

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

Creating Classes from our Labels

Recall that our labels vector is simply the value of that stock after 20 minutes. The above centering code also applied the same centering to our labels. Let’s examine what this really means. First we can plot the distribution curve for our labels.

min: -0.02869251474715062 or 2.9%

max: 0.037285348922177386 or 3.7%

Most of our data is centered around zero. This is what we expect to see, with the worst possible outcome being a loss of 2.9% of the stock price and the best possible scenario being a gain of 3.7%. It is worth noting that in reality we would always have a Stop loss placed with any trade. In most cases our trade would be around $40000 dollars and we would risk about $150. So any time we “lose” this is equivalent to just triggering our STOP. The flip side of this is that our best possible gain of 3.7% yields around $1500 profile.

The first few models we create are going to be classifiers, so we want to create N number of classes that represent equal sized buckets from the distribution above. The larger the number of classes, the harder it is going to be to learn. For now we are going to keep the number of classes to 5. This should put 2 classes in the positive range and 2 classes in the negative range with the last class representing the large chunk in the middle with values close to zero. This also with 5 classes gives us a baseline for our performance since guessing on average would yield 20% accuracy. We are therefore looking to achieve an accuracy greater than 20% as proof of “some” signal. Once we have located a signal we can start to try to tune.

Our class labels fall into these buckets:

Class 1: -0.02869 > -0.00204
Class 2: -0.00204 > -0.00058
Class 3: -0.00058 > 0.000453
Class 4: 0.000453 > 0.001992
Class 5: 0.001992 > 0.037285

We then take our label Vector and One-Hot encode the values. This gives us a final label matrix with each row containing a Vector with 5 entries containing a single 1 in the class column and zeros in every other column.

Training and Test Set

Now that we have formatted our data and normalized it we can begin to experiment with some different models. To do this we will split our data into a training and test set. We will use the training data to train the model. We will hold out the test set to run through the model and determine how well we are doing on data the model hasn’t ever seen.

This will result in our model producing 2 scores: one for the training set and one for the test set. Generally we can refer to how well the model fits the training data as Bias: A model that fits the training data better has a higher accuracy and thus a lower Bias. We can then refer to how well the model fits the test data as Variance, again higher accuracy on the test set would indicate a lower Variance. Our goal is to have a low Bias and low Variance for our model.

For our data it is often easy to achieve a high level of accuracy on the training set (low bias) but very difficult to maintain a high level of accuracy on the test set. We thus have a high variance. This is classically referred to as overfitting the data, and is kinda like saying the model is memorizing the training data but that this does not apply well the generalizing for unseen data. There are lots of way to combat overfitting and we will take a look at a few in the next section. If you would like to learn more about overfitting you can take a look at this post: https://elitedatascience.com/overfitting-in-machine-learning

Multi Layer Perceptron

I like to start with simple model designs and build them out slowly, always scaling them back when the simple design does better or equal to a more complex model.

Accuracy

Loss

We are able to achieve 30% accuracy on our test set. This is pretty good considering that the only feature at this point is price. It also provides some evidence that there is a signal in our dataset. One problem with our network above is that is it is unable to remember anything it has seen in the time series that might potentially help to make a prediction later on.

Recurrent Neural Networks (RNN)

A recurrent neural network is a special kind of neural network that posses a bit of “memory”. These networks are well suited for time series data, having a lot of success in NLP and Speech tasks. The type of RNN network we will apply is called a Long Short Term Memory. Here is some more information on these types of architecture:

The Unreasonable Effectiveness of Recurrent Neural Networks http://karpathy.github.io/2015/05/21/rnn-effectiveness/
https://en.wikipedia.org/wiki/Long_short-term_memory

There are a number of things to consider when using an RNN which include: vanishing gradients and how far back the network can remember. With really high dimensional data (2400 time steps) we will not be able to remember anything useful that far back. One thing we can try to do is the reduce the dimensionality of our data.

Dimensionality Reduction

There are number of methods for reducing the dimensionality of data. One such method is Principal Component Analysis or PCA. PCA projects the points from a higher dimensional space onto a subspace with less dimensions. An example of this would be a set of points in 3 dimensions. We could find a plain surface that best approximates the data and project all the 3d points onto the surface of the plane. We can now represent the points in 2D. We are going to apply PCA to out data to reduce the dimension before feeding it in to the RNN network. PCA however remains a poor choice to reduce our data with. Since each feature is a 1 minute time slice PCA will be looking for features over the entire space that provide little variance to the data set. Said another way, PCA might find that minute 55 provides the least significant change over the entire training set, however there still might be a small number of example that minute 55 was key to understanding that specific example.

We will first try our LSTM model using PCA to reduce the dimension and in later posts we will talk about smarter methods for reducing the dimensionality.

LSTM Model

Our LSTM model fails to do any better then our MLP. I suspect this is due to the our poor choice of dimensionality reduction at this point. These models are far more time consuming to train so I also stopped the training at 35 epochs. There are a ton of things we could tune before giving up on this style of model and we will be looking at that in future posts.

Results

Despite the low accuracy on our test set these results provide a good base. Let’s think again about what we are trying to do. We only want to take on a trades that we think will yield high returns (we will talk about shorting in another post). The means we want to take trades that have class label 5. If we use our model to predict for this class should be right almost 30% of the time. This leaves us to determine just how wrong we are for the other 70%. If we assume an equal distribution over the other classes 1,2,3,4 this means that 35% of the time we predict class 1 or 2 and the other 35% we predict class 2 or 3. In which case we are only hitting our STOP 35% of the time and the other 35% we are yielding small or no profits and not a STOP loss.

We can test this by singling out class 5 from our test set and plotting the actual predictions that were made for that class.

As you can see the most common mistake for class 5 is in fact class 1. This is rather unfortunate, but challenges me to think of why that might be. Let’s dig in a bit more to see what might be going on here.

Here is the what the distribution looks for some of the other classes

In particular we notice expected distributions for all classes except 1 and 5. This leads me to believe that we are skewing the data with outliers. These are the stocks at the far end of the positive and negative return distribution. The fact that they are outliers leads me to believe that these price movements were in response to some external market driving force (eg: Trump putting tariffs on Chinese goods etc). One thing we can try and do is to crop the positive and negative outliers out of the class they are in and add them to their own class. We will look into this more in a future post.

At this point it might be worth forward testing out model to see how well it does on real time data. I will post results from this and other tests in future posts.

Tuning

Now that we have what appears to be a weak signal in our data, we can begin to apply some techniques to tune into this signal. One of the first things to try and do will be to get more data. If we expand our search using more market symbols we should be able to double or even triple the amount of training data. From here we can spend more time trimming outliers and try to identify further indicators (apply indexes to computer RSI). There are a number of other things we can do to try and fight the high variance in our model.

Summary

This post covered a lot of ground. We first talk about loading the data and the importance of centering the data. We then dive into phrasing some categorical machine learning problems. We first apply a simple Multi Layer Perceptron and later a LSTM models. Early results indicate that there is a signal in our data set, and that even at this low accuracy on our test set we can still use the results to swing odds in our favour.

One of my motivations in releasing my own data and learning in a series of blog posts was the realization that — though I’m certain that experimentation in machine learning applied to stock market data is taking place — most of that experimentation is not being shared. I don’t believe that there is another place on the web that offers a machine learning model that enables you to apply real-world market predictions, and I’m excited to offer this to you. Make sure you sign up here to be among the first to help transform the cosmos of intraday trading.

In the next post we will introduce some new features (volume and index data), as well as try to exploit the power of a Convolutional Neural Net (CNN) on our data. We will also begin to explore some better techniques for doing smart dimensionality reduction. Finally, I will post results of forward testing the above models we created in a production setting.

Next: Part 5: Stock Market Latent Time Shifting with AutoEncoders