Stock trader with Q-Learning

8 min readApr 1, 2019

Project Definition

Traders around the world are trying to make money from the stock market by making buy, sell or sit decisions. This is a very difficult task for human-being. Since the market is changing rapidly, and is influenced by many factors, the traders need to adapt trading strategy based on the real-time variables, like closing price, number of shares the trader holds, and etc. As the development of machine learning, many work have been dedicated to build a computer trader to perform financial trading strategies for us. The project was inspired by The Machine Learning for Trading course.

The goal is to create a stock trader capable of learning from the market variables, generating (buy, sell, sit) actions, and evaluating the performance of itself. The tasks involved are as the follows:

Fetch and preprocess the historical stock data from yahoo!finance using the ‘pandas_datareader’ package
Train a trader that can decide which action to take given the current stock market environment
Evaluate performance of the trader

Metric

As the goal is to maximize the profits one can make, the ratio of profits to the invested capital is a good measure of the how profitable the trader is. It is defined as follows:

profitability = (total capital at the end of trading - invested capital)/invested capital

Analysis

Data Exploration

In the project, I fetched the historical stock data for Apple(APPL) from 2014/01/01 to 2018/01/01. More accurately, 80% of the data is used to train the trader, and 20% is kept unseen from the trader as the test data.

The dataset contains the following fields:

“date”: the calendar trading data
“High”: the highest trading price of the day
“Low”: the lowest trading price of the day
“Open”: the opening price of the day
“Close”: the closing price of the day
“Volume”: the number of shares being traded during the day
“Adjusted Close”: the closing price after adjustments for all applicable splits and dividend distributions

Based on the definition of the above fields, the Adjusted Close value, which reflects the value of the stock more accurately, is a predictor one want to include as the trader’s input.

Besides the Adjusted close, there are other common stock indicators could also serve as the inputs:

Simple Moving Average(SMA): the moving average price during a given time period
Bollinger Band: a measure of the volatility of the stock price, which is commonly considered as +/-2 standard deviation from the SMA

Exploratory Visualization

The figure below shows the distribution of normalized Bollinger band width. Most of the time, the width is within 2 standard deviation from the moving mean price. However, there are times when the market gets very volatile, where the band width is much wider. During the high volatility time, a trader should be more cautious about taking an action. Although, there is chance of making a good fortune. The risk is also much higher than usual.

The normalized Bollinger band width is a good predictor candidate, since it will be good for the trader to be aware of the risk as well as making profits.

Algorithm and Techniques

The trader is implemented using the Q-learning algorithm, which is a value-based Reinforcement learning algorithm. It needs a current state as input, then take an action based on a Q-table, and get a reward according to the action take, finally update the Q-table using the Bellman equation. This algorithm is well describe in this article.

The following parameters can be tuned to optimize the trader:

Training parameters: training episodes, policy to take an action, and learning rate
Preprocessing parameters: number of states, and window size (See the Data Preprocessing section)

Benchmark

To create an initial benchmark for the trader, I used the simple scenario where buy at the first day and sell at the last day of the given data set to compute the profitability of a given stock. The benchmark profitability of the train data is 106.8%.

Methodology

Data Preprocessing

The preprocessing is done in the “data_process.py” file, consits of the following steps:

Create the Bollinger Band width and Adjusted close to SMA ratio columns for a given time window, and drop rows with missing values due to the rolling mean computation
Normalize the Adjusted close, Bollinger Band width and Adjusted close to SMA ratio values to the values on the first available trading date
Discretize the normalized values to integer state values, by dividing the ordered values into equal amount chunks and representing each chunk with an integer
Combine the states and create a state column to hold the combined states

During the inference, the steps 1, 2, 4 are the same. For step 3, the test data is discretized to the same chunks generated from the train data.

Implementation

The trader was trained on the preprocessed training data. This was done in the Jupyter Notebook named “stock_trader”, and can be divided into the following steps:

Split the data into train and test data set, and preprocess them as described in the section above
Implemented helper function:

initialize_q_mat(): initialize a Q-table with small random numbers
act(): generate action signal (0:sit, 1:buy, 2:sell) based on the action policy. The action policy is either a random action or a action based on the current state, Q-table, depending on a given threshold. The threshold defines the chance of the trader to explore all the possible actions.
get_return_since_entry(): compute the total capital if one sells all the stocks the trader bought

3. Define the training procedures (train_q_learning function):

For each date in the stock data frame, get the current state
30% of the time, takes random action; 70% of the time, take actions based on the given current state
Get reward — if the action is buy, reward=0; if the action is hold but there is no stock shares in inventory, reward=0; if the action is hold and there is stock shares in inventory, reward=current price — previous price; if the action is sell but no stock shares in inventory, punish the Q-table with reward=-100; if action is sell and there are stocks in inventory, reward=current price - the first bought price
Update the Q-table with the following function

4. Visualize the results (visualize_results function):

plot the returns since entry and visualize the actions the trader took on each day

Refinement

To get the initial result, only the discretized normalized adjusted close price was used as the input state, and 1 training episode was chosen. This yields 118.93% profitability on the training data, which is slightly out-performed compared to the benchmark profitability of 106.83%. For inference, the profitability is 9.13% which is much lower than the benchmark 23.20%.

Improvements were made by:

Including more predictors in the state: the Bollinger Band states, the Adjusted close price to SMA ratio states
Increasing training episodes
Optimizing the threshold value of the percentage of the trader go exploring all the possible actions in the training data set.

After the tuning, the model achieves a profitability of 755.28% on the train data set, and 56.11% on the test data set.

The upper plot is the returns since the entry. The lower plot shows the actions taken by the trained trader.

Final model performance on the test data. The plot on the top shows the returns since entry as a function of time. The plot at the bottom shows the normalized adjusted close price, and the actions made by the trained trader. The blue dot is ‘hold’, green is ‘buy’ and red is ‘sell’. In general, the trader is able to recognize buy the dip, sell at top and to hold as the price is going up

Model Evaluation and Validation

The final training parameters are chosen because they yields the best return/invest ratio (the highest profitability) on both the train and test data set among the tried values. For a complete description of the final model and training process:

The predictors including the Adjusted close price, the Bollinger band width, and the ratio of Adjusted close price to SMA
The above predictors are normalized to the value on their first available date
Each normalized predictor is discretized to integers from 0 to 9, and every integer holds the same amount of values
The training process updates the Q-table 4 times, and 30% of the time the trader makes actions randomly

To verify the robustness of the final model, two validation data set is introduced, one short-term, and one long-term. The short-term validation data set is the historical stock data for Google from 2017/11/01 to 2017/12/01. The model yields a return/invest ratio of -0.0097, which is a profitability of -0.97%, which is higher than the benchmark return/invest ratio of 2.16%.

For Google stock from2017/11/01 to 2017/12/01.

The long-term validation data set is also from Google starting from 2018/11/01 to 2018/12/01. The model yields a profitability of 34.39%, which is much higher than the benchmark return/invest ratio of 10.51%.

For Google stock from2018/11/01 to 2018/12/01.

Justification

Using the Q-learning algorithm on the stock data, I got the following results:

The Q-table converges very fast, the training process completed in 1.827s.
The trained trader is capable of buying, selling and holding stock shares, and making profit from the stock market.
The trader yields higher profits on the train, test and validation dataset.

This trader is useful for generating action signals to help human making stock-trading decisions. However, the limitation is that this trader assumes one has limitless investment capital, and it can not decide how much shares to buy or sell.

Reflection

The process used for this project can be summarized in the following steps:

An initial problem is formed and background financial knowledge is gathered
The data was gathered and preprocessed
A benchmark is created for the trader
The trader is trained using the training data for a certain training episodes, and other training parameters are tuned until the best performance is achieved

The most interesting part of this project is to explore the financial knowledge. Once the commonly used stock indicators are included in the model, the model performance got improved a lot.

Improvement

Further improvements can be made by:

Including daily stock market related tweets, since the market mood is shown as an important influence factor to the stock market.
Acknowledging the limitation of Q-learning is that one needs to exhaust all the possible states before training, the algorithm will fail when a new state appear in the future. A deep Q-learning algorithm can be helpful in eliminating such limitation, since this algorithm is able to cover continuous input space.