Photo by @andyjh07

Evaluating Artificial Intelligence

When is progress real and when is it an illusion?

Alex Moltzau
Aug 19, 2019 · 7 min read

Jack Clark is the Policy Director at OpenAI, and if you are so lucky to follow his newsletter you may have come to learn about BSuite. Then again if you are following the current developments in DeepMind you may have come to learn about this directly from the source already. Behaviour Suite for Reinforcement Learning (Bsuite) is apparently the new paper worth noting, as such I decided it would be worth for me to be looking into. Synced has already written a review that is helpful too.

What is Reinforcement Learning?

“Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.”

If you want to learn more about supervised learning, unsupervised learning and reinforcement learning you can check out one of my previous articles that talk of advancements in semi-supervised learning yet runs through all three concepts briefly:

What is the BSuite?

The BSuite library facilitates reproducible and accessible research on the core issues in reinforcement learning. The code is Python, and apparently easy to use within existing projects. They include examples with OpenAI Baselines, Dopamine as well as new reference implementations. This is casually mentioned in the paper, however since you (like me) may have no prior knowledge of these I have listed two summaries.

“OpenAI Baselines is a set of high-quality implementations of reinforcement learning algorithms. These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones.”

“Dopamine is a research framework from Google for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grokked codebase in which users can freely experiment with wild ideas (speculative research).”

It may be worth noting that in this paper DeepMind outlines artificial general intelligence (AGI) as: an agent that can perform at or above human level across a wide variety of tasks.

BSuite is however additionally a collection of experiments designed to highlight key aspects of agent scalability. The stated aim of DeepMind is that these experiments can help provide a bridge between theory and practice, with benefits to both sides.

Early in their paper DeepMind outlines three major challenges for RL:

  1. Generalization: be able to learn efficiently from data it collects.
  2. Exploration: prioritize the right experience to learn from.
  3. Long-term consequences: consider effects beyond a single timestep.

There is a claimed need to understand developed systems better in this paper. Early in the article there is a call for theory: “As the psychologist Kurt Lewin said, ‘there is nothing as practical as good theory’. If we hope to use RL to tackle important problems, then we will need to continue to solidify these foundations […] theory often lags practice, particularly in difficult problems.”

In a manner it is a dataset for experimentation: “Just like the MNIST dataset offers a clean, sanitised, test of image recognition as 2 a stepping stone to advanced computer vision; so too bsuite aims to instantiate targeted experiments for the development of key RL capabilities.”

bsuite Experiments

A central part of the paper from DeepMind is a series of different experiments to map agent performance on shared benchmarks. DeepMind has set a framework for this that can be used. In the context of bsuite, an experiment consists of three parts:

  1. Environments: a fixed set of environments determined by some parameters.
  2. Interaction: a fixed regime of agent/environment interaction (e.g. 100 episodes).
  3. Analysis: a fixed procedure that maps agent behaviour to results and plots

“BSuite analysis defines a ‘score’ that maps agent performance on the task to [0, 1]. This score allows for agent comparison ‘at a glance’.”

For an experiment to be included in bsuite it should embody five key qualities:

  • Targeted: performance in this task corresponds to a key issue in RL. 1At August 2019 pricing, a full bsuite evaluation for our DQN implementation cost us under $6. 3
  • Simple: strips away confounding/confusing factors in research.
  • Challenging: pushes agents beyond the normal range.
  • Scalable: provides insight on scalability, not performance on one environment.
  • Fast: iteration from launch to results in under 30min on standard CPU.

As such it is important to note that the scale should be smaller than training a massive dataset. Thus taking smaller datasets and standard CPU into consideration this may be possible for a variety of different actors (organisations, businesses, governments etc. to do). They demonstrate this through a series of examples.

Example experiment — memory length: gives quick insight into the scaling properties of memory architecture. They refer to this experiment as memory length; it is designed to test the number of sequential steps an agent can remember a single bit.

Example experiment — deep sea: “Reinforcement learning calls for a sophisticated form of exploration called deep exploration. Just as an agent seeking to ‘exploit’ must consider the long term consequences of its actions towards cumulative rewards, an agent seeking to ‘explore’ must consider how its actions can position it to learn more effectively in future timesteps.” This had an amusing illustration I thought it would be great to share:

Screenshot of page six in the paper by DeepMind refered to in this article retrieved the 19th of August

How to use bsuite?

The paper then proceeds to run through how it could be possible to use bsuite. One is to aggregate experiment performance with a snapshot of 7 core capabilities.

Screenshot of the model shared by Deepmind on Github retrieved 19th of August 2019

They argue that one of the most valuable uses of bsuite is as a diagnostic ‘unit-test’ for large-scale algorithm development.

They mention this as useful too in research papers to measure performance:

“Another benefit of bsuite is to disseminate your results more easily and engage with the research community. For example, if you write a conference paper targeting some improvement to hierarchical reinforcement learning, you will likely provide some justification for your results in terms of theorems or experiments targeted to this setting […] If you run on bsuite, you can automatically generate a one-page Appendix, with a link to a notebook report hosted online. This can help provide a scientific evaluation of your algorithmic changes, and help you to share your results in an easily-digestible format, compatible with ICML, ICLR and NeurIPS formatting”

With bsuite, Deepmind hopes to “leverage large-scale computation for improved understanding.” They do this by collecting clear, informative and scalable experiments; and providing accessible tools for reproducible evaluation they hope to facilitate progress in reinforcement learning research.

What does Jack say?

Since Jack Clark is the policy director for arguably one of the fastest growing AI companies with a great influence in the AI community I thought it would be worth to have a look at what he thinks in the newsletter although I believe he is speaking in a personal capacity in this regard – despite this it is hard to disassociate the person with the company in question.

  • DeepMind’s testing framework is designed to let scientists know when progress is real and when it is an illusion… When is progress real and when is it an illusion?

Measurement and being specific with sophisticated reinforcement learning agents he mentions the example ways to plug BSuite into other codebases like ‘OpenAI Gym’, as well as scripts to automate running large-scale experiments on Google cloud (it seems referring to the OpenAI Baseline and Dopamine). He mentions to the LaTeX needed for conference submissions. What resonates with me is the following statement from Jack:

“…BSuite is a symptom of a larger trend in AI research — we’re beginning to develop systems with such sophistication that we need to study them along multiple dimensions, while carefully curating the increasingly sophisticated environments we train them in.”

This way there can be an increased transparency or at least communication of which type of capabilities a certain reinforcement learning agents have.

Let us be critical for a brief moment

When such a powerful actor as Google (they own DeepMind) sets the criteria and standard for evaluating it seems like a repetition of everything that has been bad or horrible in a series of other industries such as oil or tobacco — large actors self-policing. This does seem like a move towards increased transparency and openness, I would of course like to believe so. Transparency (a new word for freedom?) sounds awfully nice.

However we can question whether this type of transparency forgets the relationship of power that technology companies have — I believe OpenAI was a move towards a more responsible development of artificial general intelligence (often RL), and ensuring AI Safety. So it may be important that if evaluation was to happen it would be done by an entity that operates slightly outside the boundaries. Even if entities are outside the boundaries of companies (such as banks, oil or tobacco) they can be influenced to give ratings. I asked this question three days ago.

Perhaps I am to influenced by Stuart Kirsch and reading Mining Capitalism. Corporate science: which examines how corporations strategically produce and deploy science. Building on critiques of tobacco industry sponsored science and the research practices of the pharmaceutical industry, it draws on long-term ethnography of the mining industry to argue that the problems associated with corporate science are intrinsic to contemporary capitalism rather than restricted to particular firms or industries.

I want DeepMind, Google and OpenAI to be good actors, however we can ask the question of who or what that will evaluate the evaluators?

This is day 78 of #500daysofAI writing every day about artificial intelligence.


Collecting all of the best open data science articles, tutorials, advice, and code to share with the greater open data science community!

Alex Moltzau

Written by

Student at University of Oslo in Social Anthropology with a Minor in Computer Science. Associate at KPMG. All views are my own.


Collecting all of the best open data science articles, tutorials, advice, and code to share with the greater open data science community!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade