Beyond interpretability: developing a language to shape our relationships with AI

Been Kim
10 min readApr 24, 2022

By Been Kim, Google Research, Brain Team

This post is based on the 2022 ICLR Keynote.

We don’t yet understand everything AI can do

AI can be found in many applications as varied as protein structure and function prediction from sequence (e.g., AlphaFold and ProtENN), language understanding and generation (e.g., GPT-3 and PaLM), and many more. AI is also making important decisions. For example, it’s being used to assist doctors with patient triage, and helping with diagnoses when doctors are not available, as in Google Health’s Automated Retinal Disease Assessment (ARDA), which uses artificial intelligence to help healthcare workers detect diabetic retinopathy. It’s evident that we are going to have to partner more closely with AI in many tasks. Many brilliant minds are joining forces to realize this, with a great deal of success.

However, it sometimes does amazing things that we don’t quite understand yet. For example, AlphaGo beat world Go champion Lee Sedol in 2016 with the famous move 37. The nine-dan commentator called it a “very strange move” that “turned the course of the game”. Many Go players still talk about it today, describing it as “beyond what any of us could fathom.” How did AlphaGo decide to make this seemingly strange but destiny-defining move? AI will continue becoming more complex, bigger, and smarter. Wouldn’t it be nice if we could ask it questions to learn how and why it makes its predictions? Unfortunately, we don’t have a good language to communicate with AI yet.

A language to communicate with AI

AIs will make increasingly complex and important decisions, but they may make these decisions based on different criteria that could potentially go against our values. Therefore, we need a language to talk to AI for better alignment. A good discussion on this can be found in Brian Christian’s book, The alignment problem, which states that the essence of the alignment problem is the disconnect between intention (the purpose that we desire) and results (the purpose that we put into the machine). I like this book, not only because it references some of my work, but also because Brian hits the nail on the head: the ultimate goal of this language is to align AI and our values. When getting to know a new co-worker to better work with them towards a goal, we use human languages to learn more about how they work, their strengths and weaknesses. Our working relationship with AI should look like this, too.

This alignment problem stems from differences in representational space. Intuitively speaking, the space of things that humans know (i.e., representational space, what makes sense to humans) is different from that of what machines know. There might be some overlap, but it’s likely that there are things that only we or only machines have a representation for. For example, every sentence in each language could be seen as a point in space. I could say “the cat is big” and you would understand what I meant. However, the Go move 37 is likely to lie in the region that is inhabited only by machines. Ideally, these two representational spaces would have the one-to-one correspondence (perfect overlap), but this is not going to happen. This doesn’t even happen between two humans. Misalignment of representational spaces makes expressing and communicating our values challenging.

The goal of this human-AI language would be to increase this overlap by expanding what we know through dialogue and collaboration with machines. Learning more about machines will enable us to build machines that can better align with our goals.

This language may not look like human language at all. It may not have nouns, verbs or sentences, but it may have some elements or media that we exchange with machines, such as images, pixels or concepts.

Since our goal changes the aspect of the language that needs to be precise, we would need to develop many languages, each specific to the goal to be achieved. For example, if we are working together to build a bridge, then getting metrics right is important. If we are working together to write a diplomatic document, then getting the precise meaning of a word in the international context is crucial. Naturally, how good the language is will be evaluated by how well it achieves the goal. Humanity has developed different languages for different goals many times. For example, we invented math to communicate precise and complex mechanisms, and we are still inventing new computer programming languages to communicate instructions for computers to execute.

Two key aspects of this language

While we don’t know yet what this language would look like, we do know that it should:

  1. Reflect the nature of the machines, just like human languages.
  2. Expand what we know, such as understanding that move 37 in AlphaGo.

Reflecting the nature of machines

AlphaGo’s move 37 is one of many examples of machines’ decisions that go beyond our representational space, indicating that we are beyond the point where we can fully dissect machines into pieces and completely understand each piece. As well-articulated in Machine behavior, they have now become objects of scientific study: we have to study their behaviors both in isolation and with humans.

Some example of my work on this, which I cover in my ICLR keynote, include the following:

  • Studying machines in isolation
    - Gestalt perception:
    Studying machine’s perceptual differences with humans: do machines exhibit Gestalt phenomena in perception?
    TL;DR: Yes, but only when they have learned how to generalize [Computational Brain & Behavior 2021].
    - Human Machine perception: What information in current explanation methods can only machines see? What about information that only humans can see?
    TL;DR: some information that is impossible for humans to see is easy to see for machines, and vice versa. This work has implications beyond scientific study; how we should be using these explanations [workshop@ICLR 2022].
  • Studying machines with humans
    - Noisy Explanations:
    What type of explanations remain beneficial even when their quality deteriorates?
    TL;DR: sub-goal based explanations (showing the forest and the trees) help people better perform complex planning tasks even if the explanations are not perfect at deployment time. If these explanations are used for training humans only (not at deployment time), they are as good as the perfect explanations [paper].
    - Debugging: What are the concrete tasks that the current set of explanations can help with?
    TL;DR: The current explanations don’t seem to help most of the common debugging problems in ML (e.g., out-of-distribution inputs at test time, label errors at training time). It might help us to see spurious correlations, but only if you have a suspicion of such correlations and actively test for them [Neurips2020, ICLR2022].

Expanding what humans know

I think this is where AI’s next biggest breakthrough will come from: learning new representations and concepts that humans didn’t know before.

New insights resulting from the expansion of our representational space will not only enable greater progress towards performance, but also help us see the problem from a different angle–whether it is a science problem or a complex prediction problem. Naturally, learning something new will be complex, and we would need a dialogue that provides the ability to go back-and-forth between human and machine.

Examples of my work on this topic include the following:

  • TCAV and friends: General tools for interpretation using high-level human concepts [ICML 2018, ICML 2020], and expanding what we know by discovering new concepts [Neurips 2019, Neurips 2020, ICLR 2022].
  • AlphaZero: Studying a special super-human chess playing machine: AlphaZero.
    TL;DR: AlphaZero does contain human chess concepts, but there are many differences in the way they emerge and evolve throughout the training process [paper, visualization].
  • Concept camera: Expanding our creativity by using machines to inspire us.
    TL;DR: use the differences in ways we and machines see the world to create art from a different perspective [Concept Camera app, Mood board search].

Relationships between this language and existing efforts

There have been many efforts towards communicating our intentions to AI or understanding AI. In this section, I summarize how these efforts differ from those towards this language.

Effort 1: Objective functions and tests at deployment

Crafting objective functions that machines optimize or building tests at deployment time is our current way of communicating with machines. This is important, but not enough.

The full impact of our objective function or set of tests is unknown to us because we can only understand the part that we can represent. There could be a bigger part that we simply can’t see. In other words, we don’t know what to test. Let’s say that we’ve added a fairness metric to our model. We do this with good intentions, but this metric could end up discriminating against some other group without us realizing it. Even worse, we wouldn’t become aware of this discrimination until something went wrong. We can’t just collect and optimize all these metrics either, as some fairness metrics are mathematically proven to be incompatible. So you can’t have it all; you have to choose.

Testing alone is sometimes good enough when we know what to test. We get on airplanes without understanding them because there has been enough empirical evidence that we are likely to survive a flight. But with AI, we don’t have as much evidence yet, and we don’t even know what to test. And even if we knew what to test, testing to perfection would be hard, if not impossible, similarly to 100% test coverage in software engineering.

Until then, we have to do something.

Effort 2: Interpretability

This “doing something” is called interpretability.

Interpretability is a subfield of ML that aims to engineer our relationships with AI for either observability (e.g., “why did the machine predict X?”), or control (e.g., “how should the input change to change the prediction from X to X?”). Tools developed for interpretability are useful, but they require knowledge of the nature of these machines.

Let’s look at an example that illustrates the impact that missing the full picture can have on interpretability.

In the figure, the middle and rightmost images are examples of saliency maps, which are used as a popular explanation method. For an image classification model, each pixel is assigned a number that indicates how important it is for prediction. Both images make sense: they both seem to highlight where the bird is in the image.

However, it turns out that one of these two images is an explanation from a completely untrained network. Can you guess which one this is? The fact that it’s hard to answer this question indicates that explanations that are supposed to explain a prediction seem to have nothing to do with said prediction. Two networks make fundamentally different prediction decisions based on different logic. (The answer is the rightmost image).

Our community (and myself) was blindsided by this phenomenon, partly because of confirmation bias. The explanations made sense to us! It took many years until we just stumbled upon this phenomenon.

Despite much follow-up work, we still don’t fully understand this phenomenon. Is this happening because the information is there, but we just can’t see it? (Perhaps because we are using the wrong medium–pixels?) or Are these methods simply wrong? However, these methods are shown to be useful for certain tasks…

What this work points out is the large gap in our fundamental understanding of what these explanations are showing and what they can and cannot be used for.

Note that this doesn’t mean that we have to stop all the engineering efforts and focus on fundamentals (science) first. Humanity has always pursued them jointly (e.g., biology). Sometimes, one happens before the other, often with resulting synergy effects. We would take a similar approach to develop a language with AI.

Effort (?) 3: We just focus on achieving higher accuracy.

This might seem surprising, but I’ve heard some say “we just let AI do what it does.”

They say that we don’t need interpretability, and unconditionally criticize it without suggesting alternatives. I put this in the category of overconfidence and ignorance; some believe that AIs can be controlled because we made them (e.g., we know all the learned weights of the neural networks). It should be clear that this overconfidence is dangerous: it gives us false relief, and more importantly an excuse to not look deeply into what these machines are “really” doing. In the real world, what these machines really do can be surprising. For example, spurious correlations are often what humans are unaware of a priori. This overconfidence and ignorance can also give people an excuse to say “I had no tools to investigate that catastrophic failure case”.


AI is more than just a tool; we will be influenced by it, which will in return influence the next generation of AI. Without a language to meaningfully communicate with it, we don’t understand its decisions and, therefore, won’t know what we are creating. Building a language to communicate with AI isn’t going to be easy, but quite frankly, it’s the only way to gain control of the way we want to live. Languages shape the way we think. We have an opportunity to shape our own thinking and future machines.


Thanks to Maysam Moussalem and Mike Mozer for giving feedback and helping edit this post. Also thanks to Samy Bengio, Michael Littman, Yejin Choi for feedback on the talk that this post is largely based on. All the beauty of my slides should be attributed to amazing designers at Selman design, especially Anne Di Lillo who did the magic. Last but not least, thank you for all my amazing collaborators without whom the pursuit of building a language between AI and humans would not have been possible.

To cite this article:

title={Beyond interpretability: developing a language to shape our relationships with AI},
author={Kim, Been},