Part 5: Stock Market Latent Time Shifting with AutoEncoders

Corey Auger
7 min readMay 30, 2018

In the last post we got some good early results from our models. One thing that I did was to use PCA to reduce the dimensionality of our data. In this post we will talk about other ways of compressing our data and ultimately reducing the noise so our machine learning models can learn faster. By the end we will have a number of techniques for encoding our data, as well as a way to map between these encodings.

Noisy Data

Let’s first talk about what it means to have noisy data. It implies that there is a stronger coherent signal to something and that rather that just observing that signal we are observing oscillations around it. There are a lot of different ways that we can combat noise and/or reduce the dimensions of our data. Let’s take a look at a few:

Moving Averages

We already talked about day trader using Simple Moving Averages, indeed our test Data has even been generated off an Exponential Moving Average pattern. Both of these moving averages are ways of reducing noise in the market signal to get at an underlying trend.

Principal Component Analysis (PCA)

I talked briefly about this in my last post. We used this to reduce the 2400 time sample to both 128 and 256. Both of these yielded far better results then training on the entire 2400 timesteps. The more compressed we made the data the less noise we had to deal with which lead to way fast convergence times.

Fast Fourier Transform (FFT)

A FFT is an algorithm that samples a signal over a period of time (or space) and divides it into its frequency components. Here is an awesome youtube explanation of how FFT works.

But what is the Fourier Transform? A visual introduction.

In the case of the market an FFT should reduce signal noise for our machine learning problems. There are a lot of ways to experiment with FFT and I will try to work on some of them in future posts.

AutoEncoders

“An autoencoder is an artificial neural network used for unsupervised learning of efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction.”

https://en.wikipedia.org/wiki/Autoencoder

An Autoencoder has more power then PCA for dimensionality reduction due to the fact that it can produce non linear transformations. If you would like to read more about the differences, I found this article helpful:

https://towardsdatascience.com/pca-vs-autoencoders-1ba08362f450.

The other attractive difference is the ability to travel back or decode to the original data set with some loss. This is different than PCA which is a one way trip.

Our Autoencoder is made from 2 networks, an encoder network and a decoder network. The autoencoder is trained with those 2 networks joined. The idea is we are training weights to learn an encoding, such that the reconstruction or decoding minimizes the loss from the original data.

Here is an example network.

Source code for the autoencoder can be found here:

AutoEncoder Results

Our original set contained observations with 2420 data points. I trained an AutoEncoder to reduce this to just 121 points. This is approximately 5% of the original size.

AutoEncoder Intuition

When I started to think about how I (a human) might encode a sequence, all of my methods involved slicing the data into equal size portions and then trying to minimize the error within that slice of time. Another way I thought about the problem was to imagine the encoded value like a control point for a bezier curve. Both of these intuitions lead me to think about an encoded point having much more influence based on it’s locality, which turns out to be wrong.

I started to play with some of the values to get a better idea of what is going on with the encoding. To my surprise changing the local value only has a small impact on that region. It’s as if the temporal dependencies of the curve are somehow being encoded over the entire sequence; or said another way, each encoded value holds a bit of information about the entire curve.

With this guiding my intuition I decided to play with a mapping between two of the latent spaces derived from autoencoding. Namly the past latent space to the future latent space.

Latent Time Shifting with AutoEncoders

For this learning setup we are going to require 3 AutoEncoders:

  • The first encoder will produce a representation of the “past”
  • The second encoder will produce a representation of the “future”
  • Last we will have a network that can produce a mapping from the past to the future

Let’s outline the approach to our problem as follows. We have 2420 data points that represent 1 min of trading. The points 0–2400 contain all the trading leading up to the EMA crossover event. The next 20 minutes represent the “future” of what happened after that trade.

NOTE: The first thing we need to do is carve out a holdout set from our data. This will be data that NONE of the encoders get to see. This is very important as any data that is used to train any one of the 3 encoders will bias our results.

I have structured things as follows:

Past Encoder

I am taking data points 0–2400 and encoding them to 120 sample encoding. This will encode our understanding of the past.

Future Encoder

Next we take data points 20–2420 and encode them to 120 sample encoding. Note that there will be a large overlap of points between the past and the future. These points however will not take on the same values as they each contains weights that pertain to 20 min in the past or 20 min in the future.

Mapping Encoder

Finally since we have all our past and future encoded values we now need to learn a mapping from the past to future.

Results

There would be a number of ways that we could determine accuracy of this kind of a model. One of the better methods being to calculate the area between the 2 curves, the smaller the area the more accurate the model. Another way that will allow us to compare to our past simple MLP model is to once again take the final minute 2420 and divide the distribution into 5 buckets or classes. We then can take the predicted value and see how accurate the class prediction is. When I do this we find that the model has about 25% accuracy, this is down from the 30% that the MLP achieved and only 5% better than random guessing.

However there seems to be a very interesting correlation in the actual fluctuation of the price line with the original. In a lot of cases the model looks to have a fairly accurate account of when the price will move up or down, even if the magnitude of that movement appears to be off.

Here are some of the results. I have zoomed in to minute 2360 to 2420. Note that anything after 2400 is our predicted value where as anything before that is a reconstruction of the encoding. Again I think there is more research to be done here as some of the movement captured appears to mimic the actual price movement.

Conclusion

Although the model achieves a lower accuracy on the test set, the actual movement of price looks promising. It would be interesting to see how well models like the above could perform in other domains.

In the next post we will start to explore Seq2Seq models using LSTM.

--

--