Reinforcement Learning: Automated trading

Sujata Maurya
Grid Solutions
Published in
6 min readSep 29, 2023

Introduction:

Reinforcement Learning (RL) is a dynamic approach within the realm of machine learning that has gained significant attention in the world of automated trading. In the context of financial markets and trading, RL offers a powerful methodology for training intelligent algorithms to make informed trading decisions.

RL involves an agent, which represents the trading algorithm, interacting with an environment, symbolizing the financial market. The agent takes actions, such as buying, selling, or holding assets, with the primary objective of maximizing cumulative rewards over time. These rewards are often associated with profits and losses obtained from the actual trading decisions.

What distinguishes RL in automated trading is its capacity to learn through continuous feedback. As the agent executes actions and observes their outcomes, it receives feedback in the form of rewards or penalties. This feedback loop guides the agent’s learning process, enabling it to adjust its strategies over successive trading instances.

Overall, the integration of Reinforcement Learning in automated trading holds the promise of enhancing trading performance, enabling the development of more sophisticated trading algorithms, and contributing to the evolution of AI-powered strategies in the financial domain.

Reinforcement Learning:

Reinforcement Learning is a branch of machine learning that centers around the interaction between agents and their environments. In this paradigm, agents make decisions or take actions in their environments with the aim of obtaining the highest possible rewards. Unlike some other types of machine learning, there are no predefined labels for the data involved in reinforcement learning scenarios.

Q — Learning:

Q-Learning is a model-free, off-policy RL algorithm designed to enable an agent to learn a policy that maximizes the cumulative rewards it receives over time. This approach is particularly well-suited for scenarios where the agent doesn’t have prior knowledge of the environment’s dynamics or explicit guidance in the form of labeled data. Instead, the agent learns through trial and error, continually refining its strategies to achieve better outcomes.

1- The Q-Value Function:

Central to Q-Learning is the concept of the Q-value function. This function represents the expected cumulative reward an agent can achieve by taking a specific action in a particular state and then following a certain policy thereafter. In essence, the Q-value function guides the agent’s decision-making process, helping it determine which actions are most likely to lead to the highest rewards in the long run.

2- The Learning Process:

The heart of Q-learning lies in the iterative learning process. Initially, the agent’s Q-value estimates are initialized arbitrarily. Through interactions with the environment, it refines these estimates using the famous Bellman equation. At each time step, the agent observes the current state, takes an action, observes the resulting reward and the subsequent state, and then updates its Q-values accordingly.

Trading can have the following calls — Buy, Sell or Hold

Q-learning will rate each and every action and the one with the maximum value will be selected further. Q-learning is based on learning the values from the Q-table. It functions well without the reward functions and state transition probabilities

Reinforcement Learning in Stock Trading:

Steps to run an RL agent:

  1. Install Libraries
  2. Collect the Data
  3. Define the Q-Learning Agent
  4. Train the Agent
  5. Test the Agent
  6. Plot these Result

1. Install Libraries

Install and import the required NumPy, pandas, Matplotlib etc.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

from collections import deque
import random
import tensorflow.compat.v1 as tf

2. Collect the Data

Use the Electricity Intraday trading Market data.

This code will use a data frame that will contain the intraday prices of Japan over the course of a few months.

trading_price_df = pd.read_csv("trading_price.csv", index_col=0)
print(trading_price_df)

3. Define the Q-Learning Agent

The Agent has functions defined for buy and sell options. The rewards are subsequently calculated by adding or subtracting the value generated by executing the call option. The action taken at the next state is influenced by the action taken in the previous state. state 1 refers to a Buy while state 2 refers to a Sell. In every iteration, the state is determined on the basis of which an action is taken which will either buy or sell some stocks. The overall rewards are stored in the total profit variable.

class LearningAgent:
def __init__(self, state_size, window_size, trend, skip, batch_size):
self.state_size = state_size
self.window_size = window_size
self.half_window = window_size // 2
self.trend = trend
self.skip = skip
self.action_size = 3
self.batch_size = batch_size
self.memory = deque(maxlen=2000)
self.inventory = []
self.gamma = 0.95
self.epsilon = 0.5
self.epsilon_min = 0.01
self.epsilon_decay = 0.999
tf.reset_default_graph()
self.sess = tf.InteractiveSession()
self.X = tf.placeholder(tf.float32, [None, self.state_size])
self.Y = tf.placeholder(tf.float32, [None, self.action_size])
feed = tf.layers.dense(self.X, 256, activation=tf.nn.relu)
self.logits = tf.layers.dense(feed, self.action_size)
self.cost = tf.reduce_mean(tf.square(self.Y - self.logits))
self.optimizer = tf.train.GradientDescentOptimizer(1e-5).minimize(
self.cost
)
self.sess.run(tf.global_variables_initializer())

def act(self, state):
if random.random() <= self.epsilon:
return random.randrange(self.action_size)
return np.argmax(
self.sess.run(self.logits, feed_dict={self.X: state})[0]
)

def get_state(self, t):
window_size = self.window_size + 1
d = t - window_size + 1
block = self.trend[d: t + 1] if d >= 0 else -d * [self.trend[0]] + self.trend[0: t + 1]
res = []
for i in range(window_size - 1):
res.append(block[i + 1] - block[i])
return np.array([res])

def replay(self, batch_size):
mini_batch = []
l = len(self.memory)
for i in range(l - batch_size, l):
mini_batch.append(self.memory[i])
replay_size = len(mini_batch)
X = np.empty((replay_size, self.state_size))
Y = np.empty((replay_size, self.action_size))
states = np.array([a[0][0] for a in mini_batch])
new_states = np.array([a[3][0] for a in mini_batch])
Q = self.sess.run(self.logits, feed_dict={self.X: states})
Q_new = self.sess.run(self.logits, feed_dict={self.X: new_states})
for i in range(len(mini_batch)):
state, action, reward, next_state, done = mini_batch[i]
target = Q[i]
target[action] = reward
if not done:
target[action] += self.gamma * np.amax(Q_new[i])
X[i] = state
Y[i] = target
cost, _ = self.sess.run(
[self.cost, self.optimizer], feed_dict={self.X: X, self.Y: Y}
)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
return cost

def buy(self, initial_money):
starting_money = initial_money
sell_states = []
buy_states = []
inventory = []
state = self.get_state(0)
for t in range(0, len(self.trend) - 1, self.skip):
next_state = self.get_state(t + 1)
action = self.act(state)
if action == 1 and initial_money >= self.trend[t] and t < (len(self.trend) - self.half_window):
inventory.append(self.trend[t])
initial_money -= self.trend[t]
buy_states.append(t)
print('day %d: buy 1 unit at price %f, total balance %f' % (t, self.trend[t], initial_money))
elif action == 2 and len(inventory):
bought_price = inventory.pop(0)
initial_money += self.trend[t]
sell_states.append(t)
try:
invest = ((closing[t] - bought_price) / bought_price) * 100
except:
invest = 0
print('day %d, sell 1 unit at price %f, investment %f %%, total balance %f,' % (t, closing[t], invest, initial_money))
state = next_state
invest_amount = ((initial_money - starting_money) / starting_money) * 100
total_gain_amount = initial_money - starting_money
return buy_states, sell_states, total_gain_amount, invest_amount

def train(self, iterations, checkpoint, initial_price):
for i in range(iterations):
total_profit = 0
inventory = []
state = self.get_state(0)
starting_price = initial_price
for t in range(0, len(self.trend) - 1, self.skip):
action = self.act(state)
next_state = self.get_state(t + 1)
if action == 1 and starting_price >= self.trend[t] and t < (len(self.trend) - self.half_window):
inventory.append(self.trend[t])
starting_price -= self.trend[t]
elif action == 2 and len(inventory) > 0:
bought_price = inventory.pop(0)
total_profit += self.trend[t] - bought_price
starting_price += self.trend[t]
invest = ((starting_price - initial_price) / initial_price)
self.memory.append((state, action, invest,
next_state, starting_price < initial_price))
state = next_state
batch_size_as_memory = min(self.batch_size, len(self.memory))
cost = self.replay(batch_size_as_memory)
if (i + 1) % checkpoint == 0:
print('epoch: %d, total rewards: %f.3, cost: %f, total money: %f' % (i + 1, total_profit, cost,
starting_price))

4. Train the Agent

After defining the agent, proceed to initialize it. Configure parameters such as the number of iterations, initial capital, and other relevant settings to train the agent in making decisions regarding buying or selling.

initial_money = 100
window_size = 20
skip = 1
batch_size = 30
closing = df.Close.values.tolist()
agent = LearningAgent(state_size=window_size,
window_size=window_size,
trend=closing,
skip=skip,
batch_size=batch_size)
agent.train(iterations=30, checkpoint=10, initial_price=initial_money)

Output —

epoch: 10, total rewards: 315.190000.3, cost: 1.586462, total money: 180.060000
epoch: 20, total rewards: 281.250000.3, cost: 0.380030, total money: 86.580000
epoch: 30, total rewards: 262.840000.3, cost: 0.725655, total money: 101.310000

Test the Agent

The buy function will provide you with the buy, sell, profit, and investment values.

buy_states, sell_state, total_gains, invest = agent.buy(initial_money = initial_money)

6. Plot

Create a plot that illustrates the relationship between total gains and the invested amounts. All buy and sell decisions have been accurately indicated based on the recommendations made by the neural network.

fig = plt.figure(figsize = (25,10))
plt.plot(closing, color='b', lw=3.5)
plt.plot(closing, '^', markersize=10, color='r', label = 'buy decision', markevery = buy_states)
plt.plot(closing, 'v', markersize=10, color='g', label = 'sell decision', markevery = sell_state)
plt.title('total gains %f, total investment %f%%'%(total_gains, invest))
plt.legend()
plt.show()

Output —

Conclusion —

Q-learning is a valuable technique for crafting automated trading strategies. It empowers you to explore various buying and selling options. Additionally, there exists a wide array of Reinforcement Learning trading agents that you can explore. Consider experimenting with diverse RL agents across various stocks for a richer trading experience.

--

--