Aligning Human and AI Objectives

Published in

DEUS: human(ity)-centered Artificial Intelligence

8 min readJun 21, 2023

Even if you don’t actively work in AI, chances are, in the past few months you’ve heard something about the “existential threat of AI” or the “alignment problem of AI”.

Two out of the three so-called godfathers of AI are sounding the alarm, other people are signing petitions for a 6-month “pause” on the development of “large models like ChatGPT”, others discuss whether AI is sentient and has a soul, how all of us have to be scared, or how there’s nothing to be worried about.

An interview with Geoffrey Hinton in the New York Times after he left Google.

In all that, the alignment problem becomes implicitly connected either to sci-fi scenarios or to large, generative, complex models like the ones behind ChatGPT or Midjourney. And it’s treated like a research problem. But it’s neither solely a research issue nor occurring only when we have complex models.

Even relatively simple AI models have alignment problems. There’s no sentience associated with them. Yet, they can create real harm. Fast.

So, instead of going on a tangent discussing the probability of having a sentient system that has a soul (which potentially we might need to ban), let’s focus on the problem of alignment.

What is the alignment problem?

In simple terms, the alignment problem is the notion that the behaviour of an AI model should at all times be aligned with the values and objectives of humans.

To illustrate the difficulty of ensuring this, here’s the oversimplified example I was given years ago (similar to the paperclip problem).

Imagine we’ve built an AI model and gave it a sole objective: learn from data and interact with humans to find ways to increase the happiness of humanity. The more data we give to the model (e.g., the more it interacts with humans), the more it learns the patterns of how to achieve its objective: increase happiness. Until one day, it induces that smiling is directly and highly correlated with happiness: happy people smile.

An efficient way to ensure everyone smiles is to kill all people who don’t smile. There will be (largely) no problem with the logic employed by the model: if we have only people who smile, then it has increased the happiness of humanity. But as we all know, just because someone smiles, doesn’t mean they’re happy. Happiness is much more complicated than that. Also, killing people is bad.

Moral of the highly over-simplified story: we need to make sure we state the objectives of the system in a very clear. In a way that cannot be “interpreted” in harmful ways. Yet, this is extremely difficult. And that difficulty has little to do with the complexity of the model.

The alignment problem of simple models

It’s 2016 and Microsoft has decided they want to build a new Twitter chatbot called Tay, which is supposed to have a casual and jokey tone (so that it can sound like a millennial). For reference, the model they were using is infinitely simpler than the one ChatGPT uses.

Microsoft took precautions, of course. The bot was a version of another bot that was used for a very successful pilot in China, the team worked with writers and stand-up comedians to ensure Tay truly sounds like a millennial, and they also followed the inclusive design guidelines of Microsoft. All that led to the bot’s first tweet: hellooooooo w🌍rld!!!

After just a few hours of interacting with humans, the well-meaning hello world turned into this:

An overview on some of the tweets generated by Tay

Microsoft promptly took the bot down and apologised. Still, at the end of the day, Tay did what it was supposed to do: interact with people on Twitter and sound like a millennial. No one in Microsoft was happy about the way it went about achieving its objective, though.

Let’s travel further back in time to 2012, when Knight Capital (a top-tier Wall Street firm) decided to use a model to do robo-trading.

After months of extensive work and testing the model, Knight Capital released it into the real world. The model was live for around 45 minutes. In that time, it started overbidding on stocks and executed tons of trades. After Knight Capital had realised what had happened, they tried to reverse the trades and sell the stocks off without much success. The firm lost approximately $440 million. They were also fined additional $12 million from the Securities & Exchange Commission in the US.

Background image source: El Confidencial

After investigation, it turned out that the trading model interacted with a practically defunct legacy system no one had used since 2003. As a result, the model “assumed” it was in a test environment.

Knight Capital was bought by Goldman Sachs for pennies on the dollar to avoid bankruptcy.

Neither Microsoft nor Knight Capital used large, generative AI trained on massive amounts of data. The models they used were relatively simple and explainable. Both companies took good measures to ensure their models will perform as intended. Microsoft had a successful pilot, worked with writers and followed inclusive design guidelines. Knight Capital carried out extensive tests. Yet, the moment their models became part of a larger system, they started behaving in unintended harmful ways.

This is not to say that putting effort into aligning the model upfront is useless. On the contrary. However, we need to acknowledge the reality that once a model gets deployed, it will, inevitably, become misaligned. And when it does, the impact on the business and society could be significant.

So, then, the question is not simply “How do we align human and AI objectives”, but, more importantly:

How do we mitigate misalignment of human and AI objectives?

Mitigating misalignment

One way to mitigate misalignment on time is by actively identifying and addressing the feedback loops that influence the performance of the model.

These feedback loops come from the way humans use the model (as was the case with Tay), from its interaction with other models and, as we saw from the Knight Capital, from legacy systems.

To see how we can do that, let’s use a fictional example.

Imagine we’ve just started a fintech company that will “revolutionise” the credit card application process. In the spirit of simple examples, our startup uses a classification model to decide whether someone is eligible for a credit card or not.

The first thing we need to do is monitor what the model is doing and whether the data we’re retraining it with is the type of data we want to use.

An overview of model performance metrics, shown in real-time

In this specific case, we can track the model’s F1 score (e.g., its precision and accuracy), as well as the drift in its input features (p-values). Tracking even something as simple as that can give us an accurate depiction of what is going on with our model, at all times.

If Microsoft monitored their model performance, they could’ve avoided the Tay situation, because they would’ve seen that the data they were retraining the model with was different than their initial training data.

But talking about a complex system of humans, models, and legacy systems (be it a fictional one), and only looking at the model performance is like looking at a Van Gogh painting from up close. Sure, you can see colours and brushstrokes, but you have no clue what the hell you’re looking at.

Only when you take a few steps back, are you able to really understand how Van Gogh used colours and brushstrokes to create his paintings.

If we’re to identify the feedback loops for deployed models, we need to figure out ways to take a few steps back.

One way to do that is to take a systemic overview — track not only what the model is doing, but also how the system responds — other models, how humans are impacted, how the business is impacted, and so on.

In our fictional case, we can track, for example, the revenue our model creates.

An overview of the revenue and model performance of a deployed classification model

The moment we include a systemic overview (albeit a simple one), we can notice 2 important things.

First, when the model F1 score goes up, the revenue goes up, too.

An overview of business and model performance indicators

In this case, we can go to the model overview and check to see what might’ve happened. For instance, we can see that the data distribution we were getting has changed compared to our training data set, which caused an improvement in the model performance, which influenced the revenue. This information we can use to improve our model and target customers more effectively.

An overview of drifted features at the identified time interval

The second thing we can notice is a bit less straightforward than the first one. There is a part in the lifecycle of our credit card solution when suddenly the revenue drops, but the model is performing as well or slightly better than before.

If we try to repeat what we did before and check the data distributions we’re getting, we can see that nothing has changed in that. No features are drifting. There seems to be no reason why the revenue is dropping. Yet, the revenue has taken a significant plunge.

The revenue drop is not associated with the current performance of the model

This is actually a relatively common problem and it’s often referred to as delay of ground truth. Meaning, the model has drifted weeks or months ago but at that time, there were no significant implications for the system (aka revenue). And we can only see the impact of this drift now.

So, then, it becomes incredibly important to ensure that we’re able to identify the intervals in which the model has impacted the system. It’s also very important to know why this happened.

The platform my team and I have been developing does that — we identify the most likely, likely and less likely time intervals in which the model has impacted our revenue, over its entire lifecycle.

Identifying the most likely time interval in which the model has impacted the system

Then provide a snapshot of the system at the time of impact, plus some examples to put things into context.

All of this is, of course, only the starting point. At the end of the day, alignment is contextual — it depends on what one considers alignment: alignment to whom, by whom and in what way.

As of today, there’s only one thing that has evolved to deal well with complex contextual predicaments like this one — humans. So, the most important step after that is to ensure we have ways for humans to provide additional feedback and identify other feedback loops we might need to be aware of. We’re currently working on that.

If you’re interested in exploring the potential for expert feedback in identifying and addressing feedback loops, let’s discuss.

Aligning Human and AI Objectives

What is the alignment problem?

The alignment problem of simple models

Mitigating misalignment

Written by Niya Stoimenova