Can Reinforcement Learning Trade Stock? Implementation in R.

7 min readDec 13, 2018

Here we go. Let’s make a prototype of a reinforcement learning (RL) agent that masters a trading skill.

The next part about improvement to the experiment logic can be read by this link.

A new article you may want to read contains full experiment code.

Given that implemenation of the prototype runs on R language, I encourage R users and programmers to get closer to the ideas expressed in this material.

Intro to a Problem

Take a read of this paper: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

It will introduce you to the idea of using a Deep Q-Network (DQN) to approximate value functions that are crucial to solving a Markov Decision Process.

I also recommend a deep dive into RL math using this book preprint of Richard S. Sutton and Andrew G. Barto: http://incompleteideas.net/book/bookdraft2017nov5.pdf

Later on I will introduce an advanced version of the original DQN which incorporates more ideas to help it converge well and fast, namely:

Deep Double Dueling Noisy neural networks with prioritized sampling from an experience replay buffer.

What does make this approach superior to the classic DQN?

Double: there are two networks which train and estimate next Q values
Dueling: there are neurons that estimate state value and advantages explicitly
Noisy: there are noise matrixes applied to intermediate layers, where the noise parameters mean and standard deviations are the learnable weights
Prioritized: batches of samples from the replay buffer contain examples that made previous function trainings produce large residuals which can be stored in an auxiliary array

Well, what about trading made by a DQN agent? It is an interesting topic per se.

There are the reasons why it is interesting:

Absolute freedom to choose state representation, actions, rewards, and NN architectures. One can enrich the input space with anything they deem worthy to try, from news to other stocks and indexes.
Fit of trading logic to the reinforcement learning logic in that: agent makes discrete (or continuous) actions, reward is intrinsically sparse (after trade closing or period expiring), environment is partially observable and may contain information about next steps, trading is an episodic game.
One is able to compare DQN results with several benchmarks, such as indexes and technical trade systems.
Agent can learn new information in a non-stop fashion and thus adjust itself to changing game rules.

To get things done fast, get accounted with the code of this NN that I want to share since it is one of the puzzling parts on the whole thing.

R code for a value neural network that uses Keras backend to build our RL agent.

# configure critic NN ------------library('keras')
library('R6')learning_rate <- 1e-3
state_names_length <- 12 # just for examplea_CustomLayer <- R6::R6Class(
          "CustomLayer"
          , inherit = KerasLayer
          , public = list(
            
            call = function(x, mask = NULL) {
                 x - k_mean(x, axis = 2, keepdims = T)
            }
            
     )
)a_normalize_layer <- function(object) {
     create_layer(a_CustomLayer, object, list(name = 'a_normalize_layer'))
}v_CustomLayer <- R6::R6Class(
     "CustomLayer"
     , inherit = KerasLayer
     , public = list(
          
          call = function(x, mask = NULL) {
               k_concatenate(list(x, x, x), axis = 2)
          }
          
          , compute_output_shape = function(input_shape) {
               
               output_shape = input_shape
               output_shape[[2]] <- input_shape[[2]] * 3L
               
               output_shape
          }
     )
)v_normalize_layer <- function(object) {
     create_layer(v_CustomLayer, object, list(name = 'v_normalize_layer'))
}noise_CustomLayer <- R6::R6Class(
     "CustomLayer"
     , inherit = KerasLayer
     , lock_objects = FALSE
     , public = list(
        
        initialize = function(output_dim) {
             self$output_dim <- output_dim
        }
     
       , build = function(input_shape) {
             
             self$input_dim <- input_shape[[2]]
             
             sqr_inputs <- self$input_dim ** (1/2)
             
             self$sigma_initializer <- initializer_constant(.5 / sqr_inputs)
             
             self$mu_initializer <- initializer_random_uniform(minval = (-1 / sqr_inputs), maxval = (1 / sqr_inputs))
             
             self$mu_weight <- self$add_weight(
                  name = 'mu_weight', 
                  shape = list(self$input_dim, self$output_dim),
                  initializer = self$mu_initializer,
                  trainable = TRUE
             )
             
             self$sigma_weight <- self$add_weight(
                  name = 'sigma_weight', 
                  shape = list(self$input_dim, self$output_dim),
                  initializer = self$sigma_initializer,
                  trainable = TRUE
             )
             
             self$mu_bias <- self$add_weight(
                  name = 'mu_bias', 
                  shape = list(self$output_dim),
                  initializer = self$mu_initializer,
                  trainable = TRUE
             )
             
             self$sigma_bias <- self$add_weight(
                  name = 'sigma_bias', 
                  shape = list(self$output_dim),
                  initializer = self$sigma_initializer,
                  trainable = TRUE
             )
             
        }
        
       , call = function(x, mask = NULL) {
             
             #sample from noise distribution
             
             e_i = k_random_normal(shape = list(self$input_dim, self$output_dim))
             e_j = k_random_normal(shape = list(self$output_dim))
             
             
             #We use the factorized Gaussian noise variant from Section 3 of Fortunato et al.
             
             eW = k_sign(e_i) * (k_sqrt(k_abs(e_i))) * k_sign(e_j) * (k_sqrt(k_abs(e_j)))
             eB = k_sign(e_j) * (k_abs(e_j) ** (1/2))
             
             
             #See section 3 of Fortunato et al.
             
             noise_injected_weights = k_dot(x, self$mu_weight + (self$sigma_weight * eW))
             noise_injected_bias = self$mu_bias + (self$sigma_bias * eB)
             output = k_bias_add(noise_injected_weights, noise_injected_bias)
                  
             output
             
        }
        
       , compute_output_shape = function(input_shape) {
             
             output_shape <- input_shape
             output_shape[[2]] <- self$output_dim
             
             output_shape
             
        }
     )
)noise_add_layer <- function(object, output_dim) {
     create_layer(
          noise_CustomLayer
          , object
          , list(
               name = 'noise_add_layer'
               , output_dim = as.integer(output_dim)
               , trainable = T
          )
     )
}critic_input <- layer_input(
     shape = c(as.integer(state_names_length))
     , name = 'critic_input'
)common_layer_dense_1 <- layer_dense(
     units = 20
     , activation = "tanh"
)critic_layer_dense_v_1 <- layer_dense(
     units = 10
     , activation = "tanh"
)critic_layer_dense_v_2 <- layer_dense(
     units = 5
     , activation = "tanh"
)critic_layer_dense_v_3 <- layer_dense(
     units = 1
     , name = 'critic_layer_dense_v_3'
)critic_layer_dense_a_1 <- layer_dense(
     units = 10
     , activation = "tanh"
)# critic_layer_dense_a_2 <- layer_dense(
#      units = 5
#      , activation = "tanh"
# )critic_layer_dense_a_3 <- layer_dense(
     units = length(acts)
     , name = 'critic_layer_dense_a_3'
)critic_model_v <-
     critic_input %>%
     common_layer_dense_1 %>%
     critic_layer_dense_v_1 %>%
     critic_layer_dense_v_2 %>%
     critic_layer_dense_v_3 %>%
     v_normalize_layercritic_model_a <-
     critic_input %>%
     common_layer_dense_1 %>%
     critic_layer_dense_a_1 %>%
     #critic_layer_dense_a_2 %>%
     noise_add_layer(output_dim = 5) %>%
     critic_layer_dense_a_3 %>%
     a_normalize_layercritic_output <-
     layer_add(
          list(
               critic_model_v
               , critic_model_a
          )
          , name = 'critic_output'
     )critic_model_1  <- keras_model(
     inputs = critic_input
     , outputs = critic_output
)critic_optimizer = optimizer_adam(lr = learning_rate)keras::compile(
     critic_model_1
     , optimizer = critic_optimizer
     , loss = 'mse'
     , metrics = 'mse'
)train.x <- rnorm(state_names_length * 10)train.x <- array(train.x, dim = c(10, state_names_length))predict(critic_model_1, train.x)critic_model_2 <- critic_model_1

I used this source to adapt the Python code for a noisy part of the network: https://github.com/jakegrigsby/keras-rl

This neural network looks like this:

Recall that in dueling architecture we employ the equality (eq.1):

Q = A’ + V, where

A’ = A — avg(A);

Q = state-action value;

V = state value;

A = advantage.

Other variables in the code are quite self explanatory. Besides, this architecture is good for a given task only, so don’t take it for granted.

The rest of code is thougth to be rather boilerplate to publish, and it is a challenge for the programmer to write it on their own.

Phase I

We run our agent against a synthetic dataset. Our transaction cost equals 0.5:

Result is great. The maximum average reward should be 1.5 in this setting.

We see: critic loss, average reward per episode, cumulative reward, sample of last rewards.

Phase II

We train our agent on an arbitrarily chosen stock symbol that showed interesting behaviour: flatty beginning, rapid growth in the middle, and a dreary ending. There are about 4300 days in our training set. Transaction cost set to $0.1 (purposefully low); each reward is a USD profit/loss after buying/selling 1.0 share.

Source: https://finance.yahoo.com/quote/algn?ltr=1

After tweaking of some parameters (leaving the NN architecture the same) we came to this result:

It is not bad since after all the agent learned how to make profit pushing the three buttons on his console.

Note that at its apex the average reward per episode has beaten the realistic transaction cost that one may face in real trading.

It is too bad that stocks crash like crazy on bad news…

Concluding remarks

Trading with the help of RL is not only challenging but also rewarding. When your robot makes it better than you do, it is time to spend personal time to get educated and healthy.

I hope that was an interesting trip to you. If you enjoyed this story, show it to me. If much interest exists I can continue and show you how policy gradient methods work using R language and Keras API.

I also want to thank my friends passionate about neural networks for advices.

“ Я требую продолжения банкета!!!”

You are also welcome to read the next part: https://medium.com/@alexeybnk/improving-q-learning-agent-trading-stock-by-adding-recurrency-and-reward-shaping-b9e0ee095c8b

Can Reinforcement Learning Trade Stock? Implementation in R.

Intro to a Problem

Phase I

Phase II

Concluding remarks

Further reading

Written by Alexey Burnakov