Improving Q-learning agent trading stock by adding recurrency and reward shaping

7 min readJan 16, 2019

Reminder

Last time we built a Q-learning agent that makes trades on simulated and real stock timeseries in an attempt to check if this task domain is suitable for reinforcement learning.

By the way, the next article provides the full code of experiment, so check it.

It is just to remind that we used the following synthetic data to test the concept:

A sine function was the first strating point. The two curves simulate bid and ask prices of an asset where a spread is a minimal transaction cost.

This time, however, we want to complicate this simple task by elongating the credit assignment propagation:

The phase of sine was increased two-fold.

This means that the sparse rewards we use need to be propagated throughout longer trajectories. On top of that we severely decrease a chance to get a positive reward because the agent must have taken 2 times longer sequence of correct actions to overcome the transaction cost. Both factors make the task much more difficult even in such a simple setting as sine.

Besides, recall that we used this architecture of a neural network:

What was added and why

LSTM

First of all, we wanted to give an agent more understanding of the dynamics of changes inside a trajectory. In simple terms, agent should understand his own behaviour better: what it has done just right now and some time in the past and how state-action distribution evolved. Using a recurrent layer may address exactly this issue. Welcome a new architecture used to run the new set of experiments:

Note that I slighlty improved the description too. The one difference from the older NN is the first hidden LSTM layer.

Note that with LSTM in work we have to modify experience replay sampling to do training: we need sequences of transitions now instead of single examples. This is how it works (one but not only algorithm). We had used point-wise sampling before:

We use this scheme with LSTM:

Both before and now the sampling is governed by a prioritized algorithm.

The LSTM recurrent layer allows timeseries information forward propagation to catch additional signal hidden in past lags. The timeseries is a multidimensional tensor with the size of our state representation.

Presentations

A potential based reward shaping, PBRS, is a powerful tool to improve speed, stability, and not break optimality of the process of finding a policy to solve environment. I recomment reading at least this seminal paper: https://people.eecs.berkeley.edu/~russell/papers/ml99-shaping.ps

A potential defines how good a state we are in is w.r.t. the goal state we want to enter. A simlistic view of how it works:

There are variations and complications, which you could understand after trial and error process, and we omit these details.

One thing worse mentioning is that PBRS can be grounded by using presentations which are a form of expert of learned near optimal behaviour of an agent in an enviroment. There is a way to find such presentations for our task using optimization schemes.

A potential-shaped reward takes the following form (eq. 1):

r’ = r + gamma * F(s’) — F(s)

where F states for a potential of state, and r is the original reward.

With this stuff on our mind we go to coding.

Implementation in R

Here is a neural network code based on Keras API:

# configure critic NN — — — — — —library('keras')
library('R6')state_names_length <- 12 # just for examplelstm_seq_length <- 4learning_rate <- 1e-3a_CustomLayer <- R6::R6Class(
 “CustomLayer”
 , inherit = KerasLayer
 , public = list(
 
 call = function(x, mask = NULL) {
 x — k_mean(x, axis = 2, keepdims = T)
 }
 
 )
)a_normalize_layer <- function(object) {
 create_layer(a_CustomLayer, object, list(name = ‘a_normalize_layer’))
}v_CustomLayer <- R6::R6Class(
 “CustomLayer”
 , inherit = KerasLayer
 , public = list(
 
 call = function(x, mask = NULL) {
 k_concatenate(list(x, x, x), axis = 2)
 }
 
 , compute_output_shape = function(input_shape) {
 
 output_shape = input_shape
 output_shape[[2]] <- input_shape[[2]] * 3L
 
 output_shape
 }
 )
)v_normalize_layer <- function(object) {
 create_layer(v_CustomLayer, object, list(name = ‘v_normalize_layer’))
}noise_CustomLayer <- R6::R6Class(
 “CustomLayer”
 , inherit = KerasLayer
 , lock_objects = FALSE
 , public = list(
 
 initialize = function(output_dim) {
 self$output_dim <- output_dim
 }
 
 , build = function(input_shape) {
 
 self$input_dim <- input_shape[[2]]
 
 sqr_inputs <- self$input_dim ** (1/2)
 
 self$sigma_initializer <- initializer_constant(.5 / sqr_inputs)
 
 self$mu_initializer <- initializer_random_uniform(minval = (-1 / sqr_inputs), maxval = (1 / sqr_inputs))
 
 self$mu_weight <- self$add_weight(
 name = ‘mu_weight’, 
 shape = list(self$input_dim, self$output_dim),
 initializer = self$mu_initializer,
 trainable = TRUE
 )
 
 self$sigma_weight <- self$add_weight(
 name = ‘sigma_weight’, 
 shape = list(self$input_dim, self$output_dim),
 initializer = self$sigma_initializer,
 trainable = TRUE
 )
 
 self$mu_bias <- self$add_weight(
 name = ‘mu_bias’, 
 shape = list(self$output_dim),
 initializer = self$mu_initializer,
 trainable = TRUE
 )
 
 self$sigma_bias <- self$add_weight(
 name = ‘sigma_bias’, 
 shape = list(self$output_dim),
 initializer = self$sigma_initializer,
 trainable = TRUE
 )
 
 }
 
 , call = function(x, mask = NULL) {
 
 #sample from noise distribution
 
 e_i = k_random_normal(shape = list(self$input_dim, self$output_dim))
 e_j = k_random_normal(shape = list(self$output_dim))
 
 
 #We use the factorized Gaussian noise variant from Section 3 of Fortunato et al.
 
 eW = k_sign(e_i) * (k_sqrt(k_abs(e_i))) * k_sign(e_j) * (k_sqrt(k_abs(e_j)))
 eB = k_sign(e_j) * (k_abs(e_j) ** (1/2))
 
 
 #See section 3 of Fortunato et al.
 
 noise_injected_weights = k_dot(x, self$mu_weight + (self$sigma_weight * eW))
 noise_injected_bias = self$mu_bias + (self$sigma_bias * eB)
 output = k_bias_add(noise_injected_weights, noise_injected_bias)
 
 output
 
 }
 
 , compute_output_shape = function(input_shape) {
 
 output_shape <- input_shape
 output_shape[[2]] <- self$output_dim
 
 output_shape
 
 }
 )
)noise_add_layer <- function(object, output_dim) {
 create_layer(
 noise_CustomLayer
 , object
 , list(
 name = ‘noise_add_layer’
 , output_dim = as.integer(output_dim)
 , trainable = T
 )
 )
}critic_input <- layer_input(
 shape = list(NULL, as.integer(state_names_length))
 , name = ‘critic_input’
)common_lstm_layer <- layer_lstm(
 units = 20
 , activation = “tanh”
 , recurrent_activation = “hard_sigmoid”
 , use_bias = T
 , return_sequences = F
 , stateful = F
 , name = ‘lstm1’
)critic_layer_dense_v_1 <- layer_dense(
 units = 10
 , activation = “tanh”
)critic_layer_dense_v_2 <- layer_dense(
 units = 5
 , activation = “tanh”
)critic_layer_dense_v_3 <- layer_dense(
 units = 1
 , name = ‘critic_layer_dense_v_3’
)critic_layer_dense_a_1 <- layer_dense(
 units = 10
 , activation = “tanh”
)# critic_layer_dense_a_2 <- layer_dense(
# units = 5
# , activation = “tanh”
# )critic_layer_dense_a_3 <- layer_dense(
 units = length(actions)
 , name = ‘critic_layer_dense_a_3’
)critic_model_v <-
 critic_input %>%
 common_lstm_layer %>%
 critic_layer_dense_v_1 %>%
 critic_layer_dense_v_2 %>%
 critic_layer_dense_v_3 %>%
 v_normalize_layercritic_model_a <-
 critic_input %>%
 common_lstm_layer %>%
 critic_layer_dense_a_1 %>%
 #critic_layer_dense_a_2 %>%
 noise_add_layer(output_dim = 5) %>%
 critic_layer_dense_a_3 %>%
 a_normalize_layercritic_output <-
 layer_add(
 list(
 critic_model_v
 , critic_model_a
 )
 , name = ‘critic_output’
 )critic_model_1 <- keras_model(
 inputs = critic_input
 , outputs = critic_output
)critic_optimizer = optimizer_adam(lr = learning_rate)keras::compile(
 critic_model_1
 , optimizer = critic_optimizer
 , loss = ‘mse’
 , metrics = ‘mse’
)train.x <- array_reshape(rnorm(10 * lstm_seq_length * state_names_length)
 , dim = c(10, lstm_seq_length, state_names_length)
 , order = ‘C’)predict(critic_model_1, train.x)layer_name <- ‘noise_add_layer’intermediate_layer_model <- keras_model(inputs = critic_model_1$input, outputs = get_layer(critic_model_1, layer_name)$output)predict(intermediate_layer_model, train.x)[1,]critic_model_2 <- critic_model_1

Debug your solution thoroughly…

Results and comparison

Let’s dive straight to final results. Note: all results are point estimates and may vary when run multiple times with different random seed.

Comparison includes:

a prior version without LSTM and presentations
a simple 2-cell LSTM
a 4-cell LSTM
a 4-cell LSTM with PBRS

mean return per episode averaged over 1000 episodes

Well, it is quite an intraocular trauma here, with PBRS shaped agent converges so fast and stable compared to earlier attempts. The speed is about 4–5 times higher that without presentations. The stability is remarkable.

When it comes to using LSTM, 4 cells showed better results than 2 cells. 2-cell LSTM showed better result that no-LSTM version.

Final words

We have witnessed that recurrency and potential-based reward shaping helps. I especially liked how the PBRS showed itself out so high.

Do not trust anyone incuding me telling you it is easy to build an RL agent behaving to nicely, since it’s a lie. Each new component added to the system makes it potentially less stable and needs much tuning and debugging.

Nevertheless there is a clear evidence that task solution could be improved by just improving the methods used (data left untouched). It is just a fast that for any given task a specific range of parameters work better than others. With this in mind you are welcome to world of successfull RL.