Part 3: Generating Training Data

6 min readMay 7, 2018

Here’s where you can start to get your hands dirty.

The last post discussed the daytrader.ai system architecture for being able to forward, back test, and pattern match on our data. In this post I want to walk through some of the steps for tuning my flink cluster to identify our base level pattern. After this I will use the client to report patterns to a Training Data Generator. This is just another module that uses the flink Triggers to write out CSV files on disk. Finally, I will provide some links and resources for experimenting with this data.

Using Apache Flink

Apache Flink is a scalable generic computation platform. In a lot of ways, it shares similarities to the very popular Apache Spark project. One of the main differences is how it handles streaming or time series data.

“The main difference is Flink was built from the ground up as a streaming product. Spark added Streaming onto their product later.”

Source: http://www.developintelligence.com/blog/2017/02/comparing-contrasting-apache-flink-vs-spark/

Flink supports a number of features that make it ideal for financial time series data. I highly encourage you to at least take a preliminary look through the documentation here.

Flink Time Windows:

Flink provides Windows of time series data and a number tools to perform operations on these windows.

There are a number of useful window types. Here are some descriptions pulled straight from the documentation.

Tumbling Windows

A tumbling windows assigner assigns each element to a window of a specified window size. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure.

Sliding Windows

The sliding windows assignor assigns elements to windows of fixed length. Similar to a tumbling windows assigner, the size of the windows is configured by the window size parameter. An additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows can be overlapping if the slide is smaller than the window size. In this case elements are assigned to multiple windows.

Session Windows

The session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time, in contrast to tumbling windows and sliding windows. Instead, a session window closes when it does not receive elements for a certain period of time, for example when a gap of inactivity occurred. A session window assigner can be configured with either a static session gap or with a session gap extractor function which defines how long the period of inactivity is. When this period expires, the current session closes and subsequent elements are assigned to a new session window.

Global Windows

A global windows assigner assigns all elements with the same key to the same single global window. This windowing scheme is only useful if you also specify a custom trigger. Otherwise, no computation will be performed, as the global window does not have a natural end at which we could process the aggregated elements.

At daytrader.ai I am using a Global Window with a custom trigger. There are definitely some “gotchas” when working with the various window types. Another area that looked promising that I eventually abandoned was flink Complex Event Processing (CEP). CEP would have been a cleaner way to define my patterns but in the end I lost some of the flexibility on looking further back into my window history.

Flink Driver

This is an example of the flink driver application that we use a daytrader.ai

Note: This flink driver application was used to generate the training data that I later post in the article.

Here we define some parameters to connect to daytrader.ai gateway, create a Strategy that will contain some technical indicators:

EMA 65 at a 1min resolution: m.q.EmaParams(period = 65, interval = m.q.Interval.`1Min`)
EMA 15 at a 1min resolution: m.q.EmaParams(period = 15, interval = m.q.Interval.`1Min`)
Stochastic Oscillator with lookback period K = 14 and D (smoothing) = 3 m.q.StochFParams(interval = m.q.Interval.`1Min`, periodK=14, periodD=3, maType = m.q.MaType.SMA)

Next we define an Execution of that Strategy. This includes the starting date for the execution (or this could be real-time), ending date (again this could be real-time), and finally a list of symbols that we want to perform the analysis on.

Finally we key our stream by the symbols and pass execution over to our custom global window function.

In keeping with the past example I have created an EMA crossover windowing function. This is actually where we detect our pattern.

Detecting Patterns

The goal at daytrader.ai is to allow people to define custom patterns via a web interface and javascript. These patterns will then be pushed into the backend as flink jobs to run on the cluster. Before we can do this however I will show you what my code looks like to manually create one of these windowing functions.

Here we are generating Buy and Sell triggers based on some simple conditions.

This simply says that the most recent element in your window has an EMA-15 value that is greater then the EMA-65 value, and that ALL past EMA-15 values are below the EMA-65. Another way of saying this is that we are detecting the instant our EMA-15 crosses over our EMA-65.

There is other logic to control not entering the trade to early or to late in the day. Also we only accept the first crossover for any symbol on any given day. eg we will only have the firsts crossover for Facebook on a day that it might crossover multiple times.

The Data

You can download the data here: https://www.kaggle.com/daytrader/ema-65-crossover

There are over 6000 training example in this dataset.

The following pattern has been run on over 5 years of stock history for these symbols:

FB — Facebook
BABA — Alibaba
GOOG — Google class C
AAPL — Apple
TSLA — Tesla
MSFT — Microsoft
NVDA — NVidia
AMZN — Amazon
CRM — Salesforce
GOOGL — Google class A
ADBE — Adobe
NFLX — Netflix
INTC — Intel
BIDU — Baidu

Once the pattern has been detected I give you 2400 minute (40 hours) of pervious history. As well as 20 minutes of future history.

The data will be formatted as follows.

File name: data_N_SYM.csv

N is an incremental integer

SYM is the stock symbol ticker (FB, BABA, ect..)

Inside each of the csv files you will find 2420 lines of comma separated values, with format:

ISO formatted date, closing price, volume.

Eg:

2017–10–17T14:18:00.000Z,201.87,55800.0

2017–10–17T14:19:00.000Z,201.21,137786.0

2017–10–17T14:20:00.000Z,201.852,103695.0

2017–10–17T14:21:00.000Z,201.6,81362.0

2017–10–17T14:22:00.000Z,201.54,30183.0

2017–10–17T14:23:00.000Z,201.43,72405.0

2017–10–17T14:24:00.000Z,201.15,79411.0

2017–10–17T14:25:00.000Z,201.48,125713.0

Summary

In this post we talk about the tools that I used to generate training data. We then give away a large dataset to experiment with. In the next post I will provider python code for loading, normalizing and starting to explore the data. The possibilities are exciting, and I’m looking forward to seeing what you can do with this data.

Again here is a link to the data: https://www.kaggle.com/daytrader/ema-65-crossover

Previous: Part 2: Building an Analysis Platform

Next: Part 4: Searching for Signals