Hi Hernán, I was specifically referring to DeepMind’s success with AlphaZero over AlphaGo, in which AlphaZero was trained entirely from scratch using RL, and AlphaGo was trained initially using supervised training methods, before being optimized with RL. AlphaZero is the algorithm that was eventually able to outperform AlphaGo and human masters at the game of Go.
Hi André, thanks for reading! As the optimization is just used to find hyper-parameters (i.e. parameters for searching the observation/reward space), not used to train the model in any way, I didn’t feel it was too important to use a validation set during optimization. However, I realize that this is naive thinking on my part, and as such, I have…
Thanks for the response, I will have to experiment with activation functions such as TANH within the reward function! As for simultaneous asset trading, I already plan to add that to the next article, so stay tuned for that!
IIRC, it was a pretty simple process. I believe I just kept a list of the observation space for an arbitrary number of time steps, then applied a ColorMap using open cv. If there’s enough interest, I could possibly make a more in depth article about this in the future.