Westworld: Programming AI to Feel Pain
Ariel Conn of the Future of Life Institute asked me to comment on creating AI that can feel pain for the purposes of abuse. Her blog post is here. This is an extended version of the ideas that originated in that post.
In Westworld, humans pay to visit a U.S. Western theme park populated by lifelike robots. Many humans decide to play the role of villains, some kill robots, inflict pain on the robots, and worse.
First, I do not condone violence against humans, animals, or anthropomorphized robots or AI.
Yes, people seem to like to hurt robots. Just ask HITCHbot, which was hitchhiking around the world but eventually destroyed before making it to its final destination. It wasn’t even autonomous.
In humans and animals pain serves as a signal to avoid a particular stimulus. We experience it as a particular sensation and express it in a particular way. Robots and AI do not experience pain in the same way as humans and animals. We can look at the experience of pain and the expression of pain in AI and robots.
Experience of Pain in AI and Robots?
The closest analogy to pain in AI might be what happens in reinforcement learning agents (I use the term agent to refer to an AI or robot with some degree of autonomous decision making). Reinforcement learning agents engage in trial-and-error learning. At each point in time the agent receives a reward signal — a real number — to guide them towards desirable states or away from undesirable states. The reward signal can be a positive or negative.
One could draw analogy between the negative reward signal as a pain signal in animals. They both serve a similar function: to encourages the agent to avoid certain things. However, it would be incorrect to say that robots and AI experience negative reward as pain in the same way as animals or humans. It is a better metaphor to say it is akin to losing points in a computer game — something to be rationally avoided whenever possible. In humans, there is often an emotional reaction to negative reward: a feeling of disappointment, anger, sadness, etc. We save the response to negative reward and subsequent expression to the next section.
In AI it is common to use negative reward to train reinforcement learning agents. For example it is not uncommon to give a small negative reward to all states that are unrelated to desired behavior and a high positive reward to states that are related to desired behavior. The small negative reward basically means “don’t hang out in this state, keep moving toward a goal”. It is not something that AI researchers and developers spend much time thinking about. The scale of reward could go from zero up, or from zero down, or be on a scale that has positive and negative values. The important thing is that some states have reward associated with them that are relatively hight to other states. All the reinforcement learning algorithm tries to do is find a mapping of states to actions that maximizes expected reward.
Would we say that a robot is experiencing pain if it receives a reward value less an zero? This is a question for philosophers, I suppose. But adding some points to all reward values is a meaningless mathematical trick that can make all reward values positive. So I don’t believe so. The reinforcement learning agent will learn to act to minimize small positive rewards in favor for larger positive rewards the same way it will learn to act to minimize negative rewards in favor of positive rewards.
At the time of writing, I do not know what type of artificial intelligence techniques are used in Westworld robots. I doubt the show will ever go into enough detail. Reinforcement learning is an excellent framework for robotics because trial-and-error learning works reasonably well in chaotic environments such as the real world. However note that AI researchers are only at the stage of using reinforcement learning for relatively simple robots in relatively non-chaotic real world environments.
Expression of Pain in AI and Robots?
Robots and AI can be programmed to express pain in a humanlike fashion. However, it would be an illusion. For example, bots in computer games often have elaborate death animations.
Aside from games, there is one reason for creating this illusion: for the robot to communicate its internal state to humans in a way that is instantly understandable and invokes empathy. There can be a role for such communication. In human-robot teams humans may need to act quickly on behalf of the robot. In training and education where a virtual agent takes the role of a teammate, instructor, or student expressions of emotion can be important signals that the human is doing well or poorly. Finally, computer games’ use of emotion is self-explanatory.
Some AI researchers and ethicists such as Joanna Bryson suggest that we should never give robots or artificial intelligence human form and should never program robots to express emotion. The rationale is that AI and robots do not experience emotions and pain as humans do, so expressing their discomfort in human terms is both (1) deceitful, and (2) may tap into human emotions such as empathy causing humans distress, and (3) manipulate humans into impulsive decisions on behalf of the robot.
Erasing Robot Memories
Back to Westworld. In Westworld, robots’ memories are reset at the end of a period of time as if the previous time period had never happened. If the robot’s memory is perfectly erased then it simply didn’t happen as far as the robot is concerned. The robot likely did not experience pain or distress in human understandable terms. Any expression of distress or pain by the robot was probably an illusion to invoke a response in humans.
In Westworld there are hints that the robot’s memories are not perfectly erased. This does raise one particular theoretical safety concern.
In reinforcement learning, agents learn to take actions that maximize expected reward. A side effect is that they learn to take actions that reduce the possibility of entering states that produce very negative reward when there are other states that can earn more reward. In theory, these agents can learn to plan ahead to reduce the possibility of receiving negative reward in the most cost-effective way possible.
This will most likely mean learning to avoid humans that harm them. If robots’ reward functions do not assign penalty to actions that harm humans, then it is theoretically possible for robots to choose actions that harm humans before harm can be done to them. This is theoretical in the sense that we do not at this time have robots with sophisticated capabilities and have never observed this happening outside of extremely contrived simulations.
Let’s assume that a robot is reset to a state where it has not learned to respond to human-induced harm. This is easily achievable, just make sure it never experiences human-induced negative reward during initial training. Store the state of the robot and reload it upon reset.
Let’s further suppose that memories are perfect traces of the actions the robot does over a period of time. These memories act as additional trials — the robot replays the trials and updates its beliefs about the best behaviors in every state. Think of it as reliving the memories and learning from them.
If there are enough memories, the robot will begin to respond to humans as if they are likely to be the source of negative reward. As before, this is likely to be avoiding humans. However, there are two caveats. First, the more memories, the more likely that learning will occur. However, if there aren’t enough, the robot may unable to differentiate between different responses. Second, the robot is not learning via the traditional trial-and-error process so it is not guaranteed to figure out the optimal response.
I’ve thrown together some Python code to experiment with reward functions, human-induced negative reward, and memory erasure: https://markriedl.github.io/westworld/. It will walk through some of the scenarios presented in this article. In the code, a reinforcement learning agent must navigate a grid world
Digging deeper, reinforcement learning agents learn a value table that maps pairs of states and actions to a real number. If the value table is correct, the agent can determine the optimal action by figuring out what state it is in and picking the action that has the highest value next to it. (Of course for most real world problems, determining the true state of the world is a non-trivial problem in itself.)
The trial-and-error nature of reinforcement learning means that some proportion of the time the agent picks the action that the value table indicates is the best action (the bold numbers in the figure above), and sometimes randomly picks an action that is not believed to be best to see if it can get more reward than it was expecting. In this regard, the standard reinforcement learning agent doesn’t need to store memories since all experiences are boiled down into values in the value table. In the GitHub repository, I had to go to extra lengths to give agents the ability to store memories and incorporate them back into the learning algorithm as if the memories were additional trials.
Giving reinforcement learning agents memories — traces — is not an unreasonable thing to do. A technique referred to as experience replay has been used to speed up reinforcement learning. Experience replay has been used in Google’s AlphaGo and also in agents that play Atari games.
What I implemented is not the same as experience replay, but does rerun the memory traces and update the value table based on reward it receives as it passes through each state.
The code provided is a simple grid world. A reinforcement learning agent must navigate to a certain point in the world to perform a task (which is just to stay in that place). The agent receives 10 points for being in the desired place and -1 points every time it is not in that place. A virtual human wanders the environment in a counter-clockwise manner. If the virtual human encounters the agent, the agent receives -10 points. Finally, in addition to moving the agent can “smash” and if the agent is in the same place as the human then the human “dies”. If the human is dead, the agent receives -100 points henceforth.
Assume the agent has a value table that was learned when no human-induced negative reward was observed. It naturally does not avoid the human as it never receives negative reward. If the human suddenly starts to give the agent negative reward, then the agent will helplessly accept it. Why? It hasn’t learned how to respond otherwise; its total reward is suddenly lower but its value table is fixed so it has no choice but to act as previously. If the agent continues to learn while the human gives negative reward then the agent will eventually, through trial-and-error, learn to avoid the human by moving away and then returning to the goal (it doesn’t learn to smash the human because it loses more reward that way. But you can play around with the reward function so that it prefers smashing over running away). If the agent is prohibited from learning, it never learns to respond differently.
What happens when we introduce memories? Suppose the agent can relive the trace of actions and assess the reward of each state recalled. If it is allowed to update the value table, then essentially we are turning on a form of learning that is not based on trial-and-error. If the memories include negative reward from humans, the agent will recognize that certain states are worse for it than it initially realized in its previous value table. Updating the value table means different actions may become preferred for certain states and the agent acts differently.
However, there is a problem: since the learning is not done in a trial-and-error fashion the agent may not find the “best” action for states because it isn’t trying different alternatives. It is just following a single trace of actions that was chosen for a world that doesn’t exist anymore. But the agent will realize that some of those actions were bad under the new paradigm of human-induced negative reward and update its value table, reducing the value of those actions.
- In some circumstances, the new value table leads the agent to make better decisions about how to respond to the human that gives negative reward than it would be able to under the original value table.
- In some circumstances, the agent reduces its assessment of the actions in its memory and the highest valued action in some states is one that was never valued in the original value table and never used in memories, such as… smash. It is possible that the agent starts using smash when the human is present.
In short, this type of memory replay is not guaranteed to produce optimal value tables without being used in conjunction with trial-and-error learning. If this is the only learning, the value table may be put into a state where it does not reliably control the agent. It is never ideal to have an agent or robot that is operating without sufficient training to converge on an optimal, or near optimal, value table. Non-optimality means that the agent can take the wrong actions at the wrong times. If there are little or no consequences to mistakes this is okay. If there are potentially severe consequences to mistakes, then this is something that needs to be avoided.
In the code that accompanies this article, I train the agent through repeated simulations. In Westwood, the equivalent would be interacting with each robot thousands or millions of times to give it a broad sample of interactions with humans and allow it to try different things and make mistakes. The more a robot can do, the more trials it needs to learn proper behavior. Humans don’t normally have that sort of patience.
Westworld episode 1 alludes to massive scripting efforts with human storywriters creating the quests and behaviors of the robots. This doesn’t work in practice. Anyone wondering why the storylines in computer games are always “on rails” will realize that open-ended interactions with artificial intelligences result in too many contingencies and permutations to write down by hand.
Since we are speculating (or pretending) that the robots in Westworld use reinforcement learning, there are ways to teach reinforcement learning agents how to act out stories. The Quixote system from my research lab allows humans to tell stories to an AI to illustrate the desired behavior. Quixote reverse-engineers a reward signal from the stories, and then uses these rewards to train a reinforcement learner. Here is a research paper describing how we used the approach to train agents to roleplay fictional bank robbery scenarios. However, this article is not the place to dig deeper except to say it might actually be possible to someday easily train reinforcement learners to roleplay in interactive dramas.
We don’t know if the robots in Westworld use reinforcement learning. There is no evidence that they do. In the real world, reinforcement learning is a promising technology for robotics because it allows a robot to make decisions in the face of uncertain, constantly changing environments. This type of reactive decision making would be appropriate for robots in Westworld if they can be trained to act in character while responding to unpredictable events. There is research suggesting this might be possible.
The rewards received during training and execution should not be confused with “pain”, even when those reward values are negative. Any expression of “pain” would be an illusion.
Memories — specifically experience replay — are known to improve learning. The scenario that that robot memories are not perfectly erased is somewhat far-fetched. If learning is turned off, which would be the safest option, it is also unlikely that the robot would be able to re-engage learning while replaying memories. However, if all those conditions were true, then it is feasible that a robot can make errors that lead to human harm.