From Ballerina to AI Writer (Part Two): Embracing Deep Reinforcement Learning

Published in

BuzzRobot

6 min readFeb 10, 2018

Have you ever heard of the Strugatsky brothers, famous Russian sci-fi novelists, and their stories like Roadside Picnic or Hard to Be a God? If you are a sci-fi fan, you should read their novels. I grew up on their stories along with movies like Star Wars, Alien (love most of Ridley Scott’s movies), and, of course, Stanley Kubrick’s “ 2001: A Space Odyssey’’. These things cultivated my deep love of everything related to AI, robots, space, and astrophysics. I could never imagine that someday AI would become my profession…

I’m coming to this field from an unusual background — I was a professional ballet dancer turned AI writer. I’m diving into the research part of the technology as a key factor to understand what AI is about, its potential and limitations. To that end, I have conducted a data science project and built a neural net classifier, and now I’m developing my skills in the deep reinforcement learning approach, which is especially favored by OpenAI and DeepMind as they believe it’s contributing to the creation of a ‘super intelligence’ (waiting for some breakthroughs in the AI hardware field to increase the computational power).

What is deep RL and why humanity needs it

For those who haven’t hung out in the machine learning field for the last 5 years, “Deep Reinforcement Learning” is an AI technique based on rewards for the agent that learns how to behave in a given environment. Although the RL technique has been known for a while, it has been revitalized thanks to some breakthroughs in neural network applications that have been able to solve previously unsolvable problems in RL (neural nets serve as approximators of a policy and a value function — you will learn below what those things are).

RL embraces sequential tasks, and with the further advancement of this technique, in the future it could result in autonomous robotic surgeons to chatbots that will be able to support meaningful conversations to self-driving cars, and many other applications.

Credit to Jason Martineau (www.martineauarts.com)

Diving into the technical part

Imagine an agent in a given environment (e.g. a computer game). The environment is in a certain state and the agent can perform specific actions there. Those actions may result in a reward (e.g. higher score), and they transform the environment and lead to a new state, where the agent can perform another action. The rules for how you choose those actions in a given state is called a policy.

The environment in general is stochastic, which means the next state may be somewhat random (which looks like real life with an unpredictable future).

The set of states and actions along with rules for transitioning from one state to another build up a Markov decision process. One episode of this process (e.g. one game) makes a finite sequence of states, actions and rewards:

(si— the state, ai is the action and ri+1 is the reward after performing the action. The episode ends with terminal state sn, for example, “game over” on the screen).

A Markov decision process relies on the Markov assumption, that the probability of the next state si+1, which depends only on the current state si and action ai, but not on preceding states or actions.

Because our environment is stochastic, we can never be sure if we will get the same rewards the next time we perform the same actions. The further into the future we go, the more it may diverge. For this reason, it is common to use discounted future reward instead of maximum expected future reward without discount factor. Here is an equation:

Note that to develop RL algorithms, we need convenient tools with which we can model algorithms, research them, and create new ones. For example, in robotics the number of interactions with robots and the external world is very limited, and it’s more convenient to use a simulator. That’s why computer games are a convenient environment for testing algorithms — one can experiment and easily write a reward function in a simulated world. This is a great approach, but in some cases it’s not obvious how to write the reward function to make RL learn more efficiently (less interaction with a real environment). Those are the main outstanding problems that don’t have a solution yet.

What is Q-learning?

The whole idea of Q-learning is around the q function that refers to each pair of states and actions and returns the maximum discounted future reward which is expected for an agent to receive after executing a particular action in a particular state. Taking an action with maximum discounted future reward corresponds to the most efficient strategy.

Optimal q functions should agree with the Bellman equation (which gives a hint on how we can iteratively improve our q function approximation).

The approach is very logical — the maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.

So the algorithm of Q learning (obtaining good q-function approximation) will look like the following.

The actual experience: The Flappy Birds game as a Deep RL environment

I used Pygame as an emulation for the game Flappy Birds. Pygame is a set of libraries for Python to code computer games. The interface is wrapped_flappy_bird that presents the interface for a game and reward function and allows Flappy Birds to be used as a reinforcement learning environment.

How a neural net is applied in Q-learning

As there are many states, we can’t remember each pair state. The Q function approximator — ConvNet — is here to help us to solve the problem. It accepts an input state (an image from the game) and on output it gives a ‘q function’ value for all possible actions.

Below is the ConvNet architecture that is implemented in Tensorflow and serves as an approximator for the Q function values.

Convolutional neural network architecture

Experience replay

As Q learning has a tendency to go to a local minimum and very often gets stuck in semi-optimal/suboptimal strategies, there was an accumulated buffer of the previous experiences (trajectories) to use them for learning random examples from the experience. We are trying to avoid local minimums with the assumption that there could be a better strategy for the agent.

Voila, the result of the experiment.