The Python Quants Certificate Program: Training and Deploying a DQL Trading Bot

tsferro
6 min readApr 2, 2022

This is the latest post in a series of updates bringing the reader behind the scenes as I progress through Yves Hilpisch’s excellent Python Quants Certificate Program. Check out my publisher page for prior posts and follow me for future updates.

As the 12-week formal portion of the 16-week TPQ Certificate Program draws to a close, it is quite remarkable to reflect on how much material has been covered throughout the program up to this point. For instance, the “Execution and Deployment” lesson from the AI in Finance module this week combines several fascinating and industry-relevant topics covered throughout the program thus far, such as Deep Q-Learning (DQL)/reinforcement learning (RL), algorithmic trading, trading system backtesting, exchange API functionality and object-oriented programming (OOP).

In what follows, I will walk the reader through the various components of this week’s lesson, begging with retrieving 1-minute Bitcoin data from the Oanda API which is then used to train a DQL reinforcement-based learning trading bot. Note that in order to follow along with this post, you will need to open an Oanda practice account, create a reference API config .cfg file and also save the oandaenv.py and tradingbot.py class object files to your working directory.

Once you’ve opened an account and generated your API keys, open your favorite script editor and save a file with the extension .cfg containing the following information, and save it to your working directory:

[oanda]
account_id = <PLACE ACCOUNT ID HERE>
access_token = <PLACE API ACCESS TOKEN HERE>
account_type = practice

You will also need to install Yves Hilpisch’s helpful Oanda API python wrapper by executing the following in the shell:

pip install — upgrade git+https://github.com/yhilpisch/tpqoa.git

Now that we’ve sorted our means of connecting to the Oanda API, we can pull data for various instruments depending on our locale, including FX, CFDs, crypto, etc. We start by importing our various libraries and fixing some console settings:

Next we import our tpqoa library, and only a few lines of code later we have access to data for all tradeable instruments:

path = <YOUR WORKING DIRECTORY HERE>
api = tpqoa.tpqoa(path + ‘ONDdemo.cfg’) # adjust file name
ins = api.get_instruments()
ins[:5]
Out[6]:
[(‘AUD/CAD’, ‘AUD_CAD’),
(‘AUD/CHF’, ‘AUD_CHF’),
(‘AUD/HKD’, ‘AUD_HKD’),
(‘AUD/JPY’, ‘AUD_JPY’),
(‘AUD/NZD’, ‘AUD_NZD’)]

A peek at the first 5 instruments shows us several AUD cross pairs, and 68 total instruments are available for my account in the US.

To pull historical data, the tpqoa wrapper bypasses Oandas sourcing limitations by piecing together subsequent API calls so that the user can download extensive histories. In our case we will stick to 1 month of EURUSD 1-minute bar data:

raw = api.get_history(instrument=’EUR_USD’,
start=’2022–03–01',
end=’2022–04–01',
granularity=’M1',
price=’M’)

Which yields over 30,000 rows of OHLC data in just a few seconds:

raw.shape
Out[8]: (32914, 6)

We can also get streaming live market quotes for any instrument which is really cool and forms the backbone of any online algorithm. In our case we set the API to shut the connection after 10 ticks using the stop parameter:

api.stream_data(‘EUR_USD’, stop=10)
2022–04–01T20:58:02.093584955Z 1.1048 1.10508

This is all well and good for general research purposes, but we are here to train a DQL trading agent, so we must first begin by saving the below two scripts to our working directory. Q-learning based algorithms are quite complex, and a detailed explanation of the various methods are beyond the scope of this article. That said, these algorithms employs a method that varies between “exploitation” and “exploration” whereby the agent acts based on what it has learned thus far (exploitation) or explores randomized actions (exploration) in the effort to discover new, potentially higher rewards. Also central to the Q-learning algorithm is the concept of discounted delayed reward: at each step the agent chooses between taking an action or delaying action until the next step in the environment.

Save the following script as “oandaenv.py” to your working directory. Note that I’ve disabled the offline data storage option so the user can simply work with online data pulls:

Second, save the following script as “tradingbot.py” to your working directory:

As you can see, the algorithms are quite involved, though everything is now in place for us to train and test our DQL agent. It is worth noting that access to such scripts is one of the reasons the TPQ certificate program has such high value: while the delegate might not be capable of building such an algorithm from scratch, having access to the working code allows one to see the various components in action with the ability to tweak and test various iterations. Additionally, Yves gives detailed explanations of each method in the video lectures which is an invaluable resource considering the complexity of such algorithms.

To train our trading bot, we declare our features and instantiate 3 versions of our OandaEnv class object using 1 day of BTCUSD 30-second bars with separate versions used for trading bot learning, validation and testing phases.

We can visualize the price evolution on the downloaded trading day as well as the 3 learning, validation and testing stages

# plot data sections (learn, test, validate)
ax = learn_env.data[learn_env.symbol].plot(figsize=(10, 6))
plt.axvline(learn_env.data.index[-1], ls=’ — ‘)
valid_env.data[learn_env.symbol].plot(ax=ax, style=’-.’)
plt.axvline(valid_env.data.index[-1], ls=’ — ‘)
test_env.data[learn_env.symbol].plot(ax=ax, style=’-.’);

Which shows we have a decent sampling of both up, down and sideways price action:

Now we can begin training our DQl agent via the following:

In our case we are employing 31 learning episodes, and after each completes we output to console various statistics including total reward (a key component of RL models, perf (model return) and epsilon (percentage of time spent in exploration mode, vs exploration — this decays at a user-defined rate as episodes progress which ensures the agent progressively employs what it has learned):

episode: 15/31 | VALIDATION | treward: 937 | perf: 1.558 | eps: 0.87

We also plot individual test and test trend performance to see if there is any general performance improvement over time which is ideal in these formats, though in our case the validation performance peaks around episode 15:

Lastly, we can visualize how our agent s performs on a cumulative returns-basis versus our benchmark underlying Bitcoin r return via the following code:

Which yields:

As you can see, after an initial somewhat soft start, our agent outperformed Bitcoin for the test phase. That said, more work needs to be done including testing on more data in different market states, though it must be noted that RL-based research is compute-intensive and takes quite long to process versus other research methods which to some extent limits the scope of this type of research.

I hope this post has offered some insights into the fascinating world of RL and DQL trading agents in particular, as well as the high level of quality the TPQ program offers. In 12 short weeks, my programming and research skills have grown significantly, and I look forward to bringing everything together in the final 4-week research phase, so stay tuned to see what I come up with!

--

--

tsferro

Sharing knowledge from over 2 decades of experience in research and trading using mostly Python and R