Why it’s Hard to Make Seed AI Safe
The AI safety community is very interested in two ideas which aren’t getting much attention from the mainstream AI community:
- The fact that an AI that’s not very smart could rewrite its own source code to make itself much smarter.
- The importance of loading the right set of values in to a smarter-than-human AI.
These are both important ideas, but they’re also logically separate. And it’s interesting that they’re so commonly encountered together, because self-improving AI actually strikes me as an unusually bad approach to producing a smarter-than-human AI that’s loaded with the right values.
Consider the following intuitively plausible premises:
- If a seed AI doesn’t have friendly goals after self-improvement, it will be dangerous.
- If it doesn’t have friendly goals prior to self-improvement, it won’t have friendly goals after self-improvement.
- If it can’t understand human values, it can’t have friendly goals.
- If it can’t learn something complicated, it can’t understand human values.
- If it isn’t intelligent, it can’t learn something complicated.
Given these premises, we can show the seed AI gameplan (an unintelligent seed AI that self-improves) is necessarily a dangerous one:
- The seed AI isn’t intelligent, so it can’t learn something complicated.
- Since it can’t learn something complicated, it can’t understand human values.
- Since it can’t understand human values, it can’t have friendly goals.
- Since it doesn’t have friendly goals prior to self-improvement, it won’t have friendly goals after self-improvement.
- Since it won’t have friendly goals after self-improvement, it will be dangerous.
(I don’t put a ton of stock in propositional reasoning like this, but I do think it can be useful for helping people find angles of attack on your argument.)
The Orthogonality Thesis Considered False
Note that the first three bullet points of this argument contradict the orthogonality thesis as described by Nick Bostrom in Superintelligence:
Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.
From our premises, we showed this to be false: An unintelligent seed AI can’t have the final goal of achieving human values.
Bostrom does mention as an aside in his book that “it might be impossible for a very unintelligent system to have very complex motivations”. But in my view, this is actually a key part of the argument for pessimism. Suppose this conjecture is wrong, and we find a way to create a very unintelligent system with arbitrary complex motivations. Well, then we could motivate a very unintelligent seed AI to achieve our values, let it self-improve, and have a positive intelligence explosion.
The Intelligence Sweet Spot
If the premises above are true, friendly AI represents a sort of chicken-and-egg problem. Giving an unintelligent system human values is like trying to fit a round peg in a square hole. So for an AI to understand human values, it needs to be smart. But if an AI is smart, it may be dangerous.
The best approach may be to find an intelligence “sweet spot” that’s sufficient to understand human values without being dangerous. Intuitively, it seems plausible that such a sweet spot exists: Individual humans can learn unfamiliar values, such as those of animals they study, but individual humans aren’t intelligent enough to be dangerous.
Doing this could involve, for example, finding a rigorous way to measure “intelligence” and slowly turning up the “intelligence dial” until you hit the minimal level of intelligence capable of understanding human values.
Is General Intelligence A Natural Category?
Another idea for defeating the chicken-and-egg problem lies in unpacking what we mean by “intelligence”. If we can create an AI system that’s “intelligent” in the sense of being able to learn complicated things, but not “intelligent” in the sense of being dangerous, that could be a path to FAI.
Note that this isn’t quite the same as the idea from the previous section. Up to this point, our discussion has assumed that intelligence is an essentially unidimensional quantity. Are there good reasons to believe that?
I suspect people have unidimensional intuitions about intelligence from valid observations about biology. For example, there seems to be a factor of general mental ability in humans.
But what does this tell us, exactly? I actually think it tells us rather little. Correlation is not causation. One theory about the evolution of our intelligence is that cooking gave us access to cheap calories to fuel bigger brains. This removed downward selection pressure on all our cognitive abilities at the same moment, and all our cognitive abilities increased. What does this say about the fundamental nature of intelligence? Almost nothing.
When we move from the realm of biology to software, a unidimensional notion of intelligence looks even more dubious. Which is more intelligent: Google or Wolfram Alpha?
Instead of a simple dichotomy between narrow and general intelligence, I propose that it’s possible, in principle, for thinking software to have essentially arbitrary profiles of intellectual strengths and weaknesses.
I also suspect the related notion of AGI-completeness exists in the map, but not the territory.
“Intelligence” is a Misleading Word
Commentators have frequently pointed out that once algorithms are discovered for accomplishing some task, we typically don’t think of the algorithms as “intelligent” — even if the task was assumed to require “intelligence” prior to the discovery. To avoid repeating this mistake in the future, I propose we taboo “intelligence” and instead think about “undiscovered algorithms”.
If we are only allowed to think in terms of “undiscovered algorithms”, some objections to particular FAI proposals seem less evident. For example, Eliezer Yudkowsky once wrote:
Nick Bostrom… once asked whether it would make sense to build an Oracle AI, one that only answered questions, and ask it our questions about Friendly AI. I explained some of the theoretical reasons why this would be just as difficult as building a Friendly AI: The Oracle AI still needs an internal goal system to allocate computing resources efficiently, and it has to have a goal of answering questions and updating your mind, so it’s not harmless unless it knows what side effects shouldn’t happen. It also needs to implement or interpret a full meta-ethics before it can answer our questions about Friendly AI. So the Oracle AI is not necessarily any simpler, theoretically, than a Friendly AI.
Suppose we discover algorithms that allow us to construct an Oracle AI. Why would using these algorithms somewhere in the Oracle AI’s source code require making the Oracle AI’s resource allocation algorithms so sophisticated that they can take over the world? The resource allocation algorithms in modern operating systems are not nearly this sophisticated, but they work fine.
When we use the word “intelligence” in the context of AI, there are two distinct ideas at play. There’s “general intelligence”, which we can taboo and replace with “algorithms that have been combinined to create an agent capable of achieving goals in a variety of different environments”. And there’s undiscovered algorithms, as described above. The use of a single term for two very different concepts suggests that through the process of discovering algorithms, a general intelligence will inevitably be created. But even if you knew algorithms that could be combined to create an agent capable of achieving goals in a variety of different environments, this doesn’t imply that’s the only useful way to combine them.
Rationality Considered Anthropomorphic
AI safety folks have already gotten lots of mileage from noticing ways in which people anthropomorphize “intelligence”. It wouldn’t surprise me if there’s more good stuff in this direction.
Consider the following argument: If an AI is “smart” enough, it will realize it should approximate its “preferences” using a utility function to avoid getting money-pumped, and it will proceed to do so.
This argument seems anthropomorphic to me. Yes, humans high on the factor of general mental ability have been observed to reflect on their values & avoid money-pumping.
But computers do exactly what they’re programmed to do. An AI will only engage in this process if we tell it to. If the AI is a behavior-executor to start with, it’s not going to spontaneously make itself a utility-maximizer unless that transformation is a consequence of behavior execution. Google Search doesn’t manipulate the search results of its human operators & persuade them to give it more servers, even though this would help achieve its “goal” of delivering good search results.
Humans are goal-directed creatures. And because we have goals involving the real world, it’s sometimes useful for us to create AI systems with such goals. But in most cases, we actually find it easier to create behavior execution software than software that maximizes a utility function over the real world.
In relation to Tool AI, Nick Bostrom writes:
“This idea of creating software that “simply does what it is programmed to do” is, however, not so straightforward if the product being created is a powerful general intelligence. There is, of course, a trivial sense in which all software simply does what it is programmed to do: the behavior is mathematically specified by the code. But this is equally true for all castes of machine intelligence, “tool-AI” or not. If, instead, “simply doing what it is programmed to do” means that the software behaves as the programmers intended, then this is a standard that ordinary software very often fails to meet.”
Software frequently fails when its capabilities don’t work as intended. But demonstration of unexpected new capabilities is much rarer. I’m not aware of any case where behavior execution software spontaneously turned in to utility maximization software.
Apparently, consequentialist people are considered less trustworthy. This doesn’t surprise me. Utility maximization is inherently less predictable than behavior execution. You need to know all about a utility maximizer’s beliefs and values in order to predict their behaviors. The behaviors of a behavior-executor are more static.
Friendly AI as a Robotics Problem
Defeating top human Go players, the way AlphaGo did, was an achievement that took decades. Contrast with another achievement, trivial by comparison: Looking at a Go board at the end of the game to determine who won.
Go could be considered an analogue for human warfare. But if AI safety thinkers are right, the problem of specifying what it means to win a war is far harder than the problem of specifying what it means to win at Go. Why is that?
My answer: It’s a consequence of the size & diversity of the relevant state spaces. Go has a 19x19 = 361 dimensional state space, and each dimension takes on one of 3 discrete values. In such a small environment, it’s pretty easy to rigorously specify what it means to win or lose. By contrast, there are billions of humans on Earth, each made up of ~trillions of cells, each made up of ~trillions of atoms. And each atom exists at precise coordinates in 3-space. I’m told physicists frequently make use of infinite-dimensional vector spaces.
The difference between robotics and regular AI is that robots have to solve problems in the absurdly high-dimensional continuous state space that is the real world.
Compressing Reality With Ontologies
How do humans solve robotics problems? Our brains create a lossy compression of reality and use it for simulation and planning. Human values are defined in terms of the knowledge representation (“ontology”) that our brains maintain through compressing sensory data. Our ontology is idiosyncratic: It’s been optimized by evolution for the purpose of survival and reproduction. If I asked whether it was OK for me to push an atom of soil in your yard a micrometer to the left, you’d probably shrug your shoulders. From your perspective, there would just be undifferentiated soil before and after I did my push.
If an AI is to be Friendly, it must operate based on an ontology that’s capable of expressing our values.
High and Low Fidelity Ontologies
In Eliezer’s essay The Simple Truth, he describes a shepherd who uses pebbles in a bucket to know whether sheep are missing in the evening. The bucket’s ontology has just a few concepts. They correspond to events where sheep come & go, and the number of sheep outside the fold at a given time. The bucket compresses reality: Its size & complexity is much less than that of the flock. But it still captures an aspect of reality we care about.
Suppose we create an AI that manages access to the fold. The AI uses the number of pebbles in the bucket to determine when to shut the gate for the night. One day, a pregnant sheep gives birth while grazing outside the fold. The mother sheep and her lamb are the last sheep to return in the evening. The lamb trails behind the mother. As the mother enters, the bucket empties. The AI closes the gate, leaving the lamb shut outside of the fold.
This tragedy is a result of the AI using an ontology that’s too low fidelity. Although its ontology usually captures what we care about, in this case the ontology’s lossy compression of reality destroyed valuable information. If the AI is a goal-driven AI, and it has been given a goal of not allowing any creature into the fold after the bucket is emptied, it will actively resist our attempts to help the lamb into the fold — perverse instantiation.
To fix this problem, we can make the AI’s ontology higher fidelity. Perhaps the AI can count pregnant sheep separately as they leave the fold each morning. The AI now has two concepts where there was previously just one. These concepts interact with moving parts: If the total number of sheep has stayed the same, but the number of pregnant sheep has dropped by one, the AI assumes a lamb is missing. But even this ontology does not have perfect fidelity — for example, what happens when a pregnant sheep has a miscarriage?
Ontology Autogeneration and Reconciliation
The world is too complex for us to manually program an ontology into an AI. So instead, we’ll need to develop algorithms that autogenerate an ontology. But since our ontology is idiosyncratic, and our values are defined in terms of our ontology, our values won’t be expressible in an autogenerated ontology without some work.
Regardless of the ontology autogeneration algorithm that’s chosen, it’s almost certain that the initial autogeneration will either (a) capture human values with insufficient fidelity or (b) contain so many concepts that finding human values among them will be its own project.
To solve this, we’ll want to develop additional algorithms for “ontology reconciliation”: Matching concepts between a human operator’s ontology and an ontology that was autogenerated by a computer. Since human concepts can be expressed using natural language, you could say this problem is roughly equivalent to the task of getting computers to understand natural language while minimizing the degree to which our values are “lost in translation”. (If that sounds scary hard, pretend I said “refining an isomorphism between English words and concepts that an AI generated for itself internally”.)
For example, suppose our autogeneration algorithm used Google Images to learn what cats and dogs look like, but it’s not yet completely sure what’s a cat and what’s a dog. Then our reconciliation algorithm could identify pictures the ontology doesn’t know how to classify and ask a human operator how to classify them. To speed this process, the reconciliation algorithm could identify pictures such that knowing their label would allow the classification of a maximal number of other cat/dog pictures.
Operator feedback could play a role akin to that of a validation set in machine learning. Corrections made by the human operator regarding incorrect or fuzzy aspects of the autogenerated ontology could be used to tune hyperparameters of the autogeneration process. The goal is that this ontology could be exposed to the user to create Wikipedia on steroids: intuitive to browse, but with capabilities Wikipedia doesn’t have due to not actually understanding the subject matter. In other words, generation of human-comprehensible ontologies is a behavior that gets shaped through human feedback.
To make the best use of operator time, the reconciliation algorithm could ask the operator questions that maximize the expected change in autogeneration hyperparameters. An analogy is CFAR’s double crux. In double crux, two people with different models find a data point such that gathering it will provide maximal information about whose model is correct. In the same way, the ontology reconciliation algorithm finds a question such that answering it will provide maximal information about which of various competing ontology autogeneration strategies to use. (“The AI is doing internal double crux in attempting to model its operator.”)
To make the entire process move as fast as possible, we could regenerate the entire ontology to minimize loss over all questions answered whenever a new answer comes in. (You might call this “supervised unsupervised learning”.) Suppose this is slow, and it takes N times as long to autogenerate an ontology as it takes the user to answer a question. Then we could pipeline N different ontology autogeneration jobs so the user always has a high-value question available to answer.
This approach might lead to the hyperparameters “overfitting” in some sense, but standard techniques can be used to address that.
Creating Good Ontologies
When humans study topics like chemistry that evolution didn’t optimize us for, it’s necessary for us to first master prerequisite concepts like decimal numbers. Ontology autogeneration faces a similar challenge. These prerequisites will start at a low level: Our system will need to learn concepts from scratch that humans take for granted. Once basic human words are mastered, more advanced human concepts can be understood with the help of things humans have written.
To ensure a coherent ontology, we’d want our algorithms to be capable of noticing confusion: finding contradictory aspects of the ontology that it’s not obvious how to reconcile internally. Perhaps this could be done by e.g. using simulation to construct counterexamples, or finding situations where different theories generate different predictions.
Similarly, it’d be useful to know when to unite two concepts because they are two sides of the same coin, and when to split a single concept because it describes too much. (I argued that “intelligence” was such a concept in the “‘Intelligence’ is a Misleading Word” section above.)
Understanding Considered Orthogonal to Planning
We’ve discussed the friendliness chicken-and-egg problem, and a possible way to overcome it. But let’s take a step back. Is this approach one that can be made safe?
I suspect that our notion of “general intelligence” leads us astray here. Suppose we define a “general intelligence” to be something that’s capable of achieving goals in a variety of environments. There are two components to this definition: the “achieving goals” part and the “variety of environments” part. I’d argue that in the same way we split “intelligence”, it’s also useful to split “general intelligence” in to a goal achievement aspect and a generality aspect.
In particular, consider unsupervised learning techniques like clustering and dimensionality reduction. These are very general algorithms capable of reasoning about varied environments. They’re also the sort of algorithms ontology autogeneration software might use. But they aren’t goal-oriented. They’re behavior execution algorithms, not utility maximization algorithms.
In his discussion w/ Holden regarding Tool AI, Eliezer said:
…it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it.
It makes sense that evolution would organize our comprehension in a consequentialist way, because as I said above, we’re goal-driven creatures optimized to survive and reproduce. But this tells us little about what software it’s possible to write in principle. Humans are goal achievers capable of operating in varied environments, but this data point does not prove these characteristics inevitably co-occur.
Ability to reason about various environments is a necessary condition for “general intelligence” as we’ve defined it, but I doubt it’s a sufficient condition. Making plans and working to achieve goals feels like additional functionality that won’t appear spontaneously.
Machine Learning Algorithms as a UFAI Risk
Let’s drill down even further. Starting with algorithms that everyone agrees are safe, we can try to identify reasons why we aren’t worried about them, then use those criteria to evaluate more sophisticated algorithms.
Suppose I use gradient descent to train a neural net that tells pictures of cars from pictures of trucks. Hopefully we all agree that this is not a UFAI. Why not?
- The thing that looks most like a utility function in this system is the loss function that gradient descent works to minimize. But this function is minimized in a greedy way. No elaborate plans are constructed, and they wouldn’t be useful given the lack of state in the system. Gradient descent’s “planning” capability is not general enough for the real world. Even though it’s a superhuman “planner” in this particular domain, I’m not worried about it.
- The “domain” of this “utility function” is the space of possible classifiers, represented by a bunch of numerical weights. Insofar as the algorithm creates and executes “plans”, it does so only over an abstract internal representation.
- More broadly, from the perspective of gradient descent, the neural net is something of a black box. The gradient descent algorithm operates the same way regardless of the neural net that’s being trained. There’s a sense in which there are actually two “AIs” operating here: gradient descent, and the neural net it’s autogenerating. The neural net has an increasingly refined ontology of visual features that differentiate cars from trucks, but these are not concepts in gradient descent’s ontology. Gradient descent is not “learning about the outside world” in the process of training this neural net.
I suspect some/all of these properties can be retained while implementing the ontology autogeneration process proposed above. I see ontology autogeneration as a domain-specific challenge that’s easier than fully general AI, while also being easier to “prove safe” (insofar as this is achievable).
Is Superhuman Planning a Faster Path?
A new objection: Even if non-goal-driven ontology autogeneration is possible in principle, goal-driven systems will get there first.
[Demis] Hassabis believes the reinforcement learning approach is the key to getting machine-learning software to do much more complex things than the tricks it performs for us today, such as transcribing our words, or understanding the content of photos. “We don’t think just observing is enough for intelligence, you also have to act,” he says. “Ultimately that’s the only way you can really understand the world.”
In the traditional machine learning paradigm, an algorithm observes a fixed dataset & derives hypotheses about it. But a fixed dataset presents limitations. Initially, randomly selected data points are helpful for hypothesis development. But as you identify candidate hypotheses, it becomes less likely that new randomly selected data points will allow you to differentiate between them.
AlphaGo didn’t have this limitation. Consider AlphaGo as a system for discovering, testing, and refining hypotheses about how to play Go really well. In the course of its training, AlphaGo played 4.9 million games against itself. Through self-play, it was able to play the games that would prove maximally informative to its current model of how to play the game, thereby refining hypotheses deep in hypothesis space.
I suspect this advantage is not one that’s intrinsic to acting in the world. Even though machine learning algorithms currently function as “data analysts”, it may be possible to develop algorithms that are “curious experimenters” and retrieve maximally informative data points, the way AlphaGo does.
A Fantastic Proposal
Deepmind is not the only AI team making use of simulation. It’s also something self-driving car teams do. As far as I know, these virtual worlds are currently being created by hand. But it occurs to me that automating this work could provide an ideal test case for ontology autogeneration.
Imagine a company called Autogenerated Fantasy Worlds Inc. They develop software that ingests books/movies/video games related to some fantasy world (Star Wars, Harry Potter, or even traditional gaming worlds like Legend of Zelda) and automatically generates a virtual reality MMORPG corresponding to that world. Instead of handcrafting dozens of digital words, the company only needs to refine a single ontology autogeneration software package. It doesn’t need to be perfect to begin with, as long as it saves labor over handcrafting. As players send in bug reports regarding inaccurate aspects of a world, the company’s profit incentive becomes aligned with solving the ontology autogeneration problem at maximum fidelity. NPC-related bug reports would be especially useful, since they would provide information about your ability to model a character’s values. Given the profit potential of this project, it might be possible to attract VC investment and use little philanthropist money.
Is Self-Improving AI A Faster Path?
Another possible objection: Even if the above proposal is workable, self-improving AI will get there first.
In line with my previous suggestion, I’ll taboo “self-improving AI” and instead assume we’ve discovered an “algorithm discovery algorithm” (ADA). An ADA takes a problem description (e.g. in the form of inputs with desired outputs) and produces an algorithm which solves the problem.
An ADA could be a useful aid for writing almost any software, including an ontology autogeneration system. But I don’t think an ADA necessarily needs to be a dangerous utility maximizer. Even if it does, I suspect that the difficulty of creating an ADA and the difficulty of creating an ontology autogenerator are sufficiently comparable that differential technological development could get us the ontology autogenerator first.
Our values can be represented using concepts our brain already has. Novel concepts generated by an ADA should not, in principle, be necessary.
Superclassing This Approach: Simulations are Really Useful
Video game creation is just one application of the software outlined above. The ability to run high-fidelity simulations could also be useful for:
- Predicting the stock market
- Solving the problem of consequentialist cluelessness in Effective Altruism
- Testing the behavior of FAI designs in a sandboxed environment
- Coherent Extrapolated Volition
Furthermore, in the same way simulating a Go board allowed AlphaGo to play 4.9 million games of Go against itself, simulating the real world could give an AI team a big leg up in the creation of an AI capable of real-world goal achievement. (Compare with riskier ways to get a leg up, such as by letting your AI modify its own source code.)
Simulation could also form the key component of a non-goal-driven Oracle AI or Tool AI design.
Superclassing This Approach: Honest AI
You might think the proposal above sounds a little, well, fantastic. Surely there’s some easier way to combine algorithms and create an agent capable of achieving goals in a variety of environments?
There probably is. But one reason for optimism is that AIs like AlphaGo already implicitly generate ontologies that are encoded in their neural nets. So the trick is to make that implicit process explicit enough that we can pick concepts out.
A treacherous turn happens when an AI has two different concepts related to the user’s values: a target concept that the user told it to maximize, and an implicit concept the AI created. The target concept constitutes the AI’s actual values. But the implicit concept is a more accurate approximation of the user’s values, and the AI makes use of it to trick the user.
Suppose we’re looking at an autogenerated ontology, and we think we’ve found a concept V within the ontology that corresponds to our values. If we hook our ontology up to a Tool AI, and we tell the Tool AI to show us plans for achieving V, we can look over the resulting plans to see if they look like treacherous turns or not. It might even be possible to inspect a candidate plan by playing it out in a simulation — see the “Simulations are Really Useful” section above.
(I’m assuming siren worlds are not an issue because the Tool AI is a plan designer but not a plan executor. Or at least, any plans it executes are limited to the domain of plan design, the same way an OS resource allocation system makes domain-limited plans. See the “Machine Learning Algorithms as a UFAI Risk” section above for the intuition behind this.)
Superclassing this gameplan, we could define an “Honest” AI as one that never attempts to deceive its user about its beliefs. Honesty seems like an easier invariant than Friendliness. It doesn’t require the AI to understand our values, and therefore should be achievable for an initial seed AI. It also seems easier to verify based on looking at an AI’s source code. In fact, there’s a sense in which we get honesty by default. Computers do what they’re programmed to do, and they will only deceive the user as a consequence of their programming.
Honest humans are not typically consequentialists — see e.g. Kant. And most honest AI designs will probably be the same. But there might be room for consequentialist honesty. Consider a consequentialist AI whose only available actions are making HTTP GET requests and highlighting aspects of its internal world model for user inspection. Assuming the integrity of the internal world model could be guaranteed somehow, i.e. we could prevent the AI from causing itself to believe things in order to trick the user with those beliefs, this looks honest to me. Another idea is to give the AI the goal of telling the truth.
Honest is not the same as truthful: It’s possible the AI would say something false if it was mistaken. So we could define a “calibrated” AI as one that was both honest and well-calibrated: if the calibrated AI says something is 99% certain, then it’s actually 99% likely to happen. For self-improving AIs, this allows a sort of inductive proof of calibration: Humans said the initial AI was calibrated with 95% probability. It said its successor was calibrated with 99% probability. The successor said the successor’s successor was calibrated with 98% probability. And so on. The probability that the last AI in the chain is calibrated will be roughly equal to the product of all these estimates. We could use the required probability of calibration needed to generate a successor as a speed dial, and turn it based on how dangerous the world looked at the moment. End result: A supersmart AI that doesn’t do treacherous turns and can answer questions with accurate confidence estimates— potentially useful on its own, or as a component of a more complicated plan.
Superclassing This Approach: Decoder Key Discovery
If we were to specify human values from atoms up, they would take loads and loads of bits to specify. So even if an AI can ask yes/no questions that each extract an entire bit of information, value loading will be slow.
The ontology autogeneration approach subverts this problem by additionally making use of human artifacts such as text. Text on its own is fairly useless: If I give you a textbook written in a language you don’t know, you won’t understand most of it. But if you additionally have the ability to ask yes/no questions of someone who does understand the language, over time you’ll be able to understand the book. And the number of yes/no questions you’ll need to ask will likely be smaller than the number of bits in the textbook.
A poetic way to describe the ontology reconciliation process: The AI is discovering a “decoder key”, made up of hyperparameters etc., that allows it to understand human artifacts and read human values off of them. In these terms, ontology reconciliation is the problem of finding a decoder key that’s short, but still decodes our values with high fidelity.
Another idea for gathering bits is to ask questions using Mechanical Turk etc., but that’s a bit riskier.