Towards a shared AI safety portfolio prioritization framework

Published in

BuzzRobot

9 min readFeb 4, 2018

Summary

I propose a basic two-axis framework to help actors in the AI safety scene transparently show why they are doing which work, enabling cooperation and structured discussion among them, falling in line with a portfolio approach to AI safety. The framework can also help meta-orgs/ the community communicate and cluster gaps to close in AI safety and shows what the core tasks of each actor group are. The two axes are “Problem priority” and “Marginal impact”. Work still needs to be done to ensure this framework transparently includes scenarios and that shared levels of abstraction are found. Template for the framework as well as limitations below.

Intro and need for a framework

While becoming more familiar with the field of AI safety research, I noticed a recurring theme: Be it in work on practical, i.e. technical research, strategic or policy research, or close versus far future, I could not find a framework that could easily help sort any undertaken research into an order that would highlight the value levers attached. In my view, such a framework should highlight different failure modes and scenarios according to their impact and, combined with our ability to affect the risks, guide the community’s work. In addition, it should improve transparency within the safety research community on ongoing efforts and enable improved coordination. It may also be used as a guideline to finding double-cruxes and uncover disagreement on research prioritization more quickly.

The inexistence of such a framework may be due to a) the large uncertainty and complexity surrounding differing scenarios, rendering any shared sorting mechanism un-useful, b) because it is generally considered trivial (and coordination is therefore already happening), or c) because nobody has published one yet. In fact, most of the thoughts in this paper are already present in extant research or implied. For example, Allan Dafoe and Owen Cotton-Barratt both touched upon the core dimensions of this model in their talks on EA policy (check them out if you haven’t), i.e. severity, probability, leverage. I however believe they were never integrated into one framework that could help researchers transparently show how their research helps reduce risks from AI and how they would place their research in comparison to others’. Either way, I will try to outline the dimensions along which I believe AI safety risk cases should be prioritized.

When I talk about risk cases, I am referring to any AI behaving sub-optimally as compared to the goals an altruistic designer would have in mind — including value alignment problems, problems arising from AI malfunctions, and non-optimal goal setting. “Sub-optimal” could be considered euphemistic, as it may also include full transformative AI failure modes including catastrophic/ existential risk-related outcomes. Superintelligence is not required for this framework to hold. Exemplary for the large amount of near-future technical problems still waiting to be solved, check this paper and this (non-exhaustive) collection of recent AI failures.

The framework

The framework follows the importance — neglectedness — solvability approach, adapted by integrating neglectedness and solvability into “marginal impact”. Instead of “importance”, a priority score is used, including urgency and counterfactual development.

This framework is scope insensitive in such a way that it can be used to both evaluate complete problem fields, individual problems, and individual solutions — however only on a shared level of abstraction, i.e. it would not make sense to plot technical solutions with high level problem fields. I would propose to start by mapping problem fields, then move down to individual problems, then to individual solutions. This framework only related to AI Safety actors. Actors focussed on AI performance are only insofar represented as their concern for safety is instrumental to their performance goal, i.e. underlying the projected ability to control risk in 2a).

I believe three dimensions to drive the importance of a failure mode. To get a “Priority score” for the scale of the problem, multiply 1 and 2, and 3.

Importance of risk: Scale of suffering caused (without intervention) as compared to optimal route weighted by probability of problem-existence. Important note: Scale of suffering includes a judgment of how maximally bad it would be if the problem materialized. The probability however does not refer to the probability of us failing to solve the problem, resulting in suffering. It refers to the probability of the problem actually being one based on how likely we believe certain problems are. For example, you may believe goal retention is not a likely problem because there is no incentive for an AI to change its utility function. You may however still assign some non-zero probability (1) to that problem actually being one — regardless of how likely it is that we will be able to control that problem (2), should it materialize.
Probability of risk manifestation: Projected, upon arrival, probability that we will not be able to control the risk, i.e. some suffering will materialize, multiplied with expected percentage of maximum suffering occurred. All based on no intervention developed by the AI safety community. Types of interventions reducing risk:
a) Prevention of failure mode
b) Intervention during failure mode
Note: This includes a counterfactual: If the AI safety community did not get active, how high do you think the chances are that a performance-oriented developer might solve it bc/ the solution is instrumental?
Urgency factor
a) Time/ resources needed to reduce risk to level where other options become priority (minus) time/ resources available until failure materializes
Note: This is currently linear, but of course does not have to be. It also includes a judgement about best options to solve the problem

Secondly, I propose applying a “Marginal impact factor”, consisting of

Solution value
a. Increase in ability to control risk → reduction in failure mode probability and/ or intervention effectiveness
b. Spill-over effects on other desiderata (see Flynn et al., 2017)
Marginal ability to achieve solution value as compared to next best actor
a. Probability of success
b. Resource intensity
c. Coordination capability

Combined, these factors give both individual actors (researchers, strategists, developers, policy-makers, etc.) and the AI safety community (coordinators, meta-orgs) a way to prioritize their work. “Marginal impact score” within the community decision mechanism refers to the accumulated capability of all AI Safety actors.

Inclusion of scenario thinking

Note that AGI arrival scenarios are all underlying these judgements — not only in the urgency factor, but also e.g. in the projected ability to control risks without any intervention of the AI safety community and the evaluation of projected solutions. For example, an option to improve intervention ability in a value alignment failure mode may include extending the “time to react”, e.g. by boxing any AGI for prolonged periods of time for security-testing. An AGI performance race or a hard take-off scenario (or both) would strongly reduce the value of that option.

An alternative way to include scenarios is to spell them out on n additional axes, turning this framework from a two to a 2+n dimensional one. Such a choice however comes with technical difficulties:

If adding such an axis removes the probability of each of the scenarios occurring from the priority score/ marginal impact score, one could nicely show how certain problems/ solutions move in their combined importance based on changes in the scenario — however removing the probability of each of the scenarios from the judgement. This is obviously not ideal, as extremely low probability scenarios could lead to overall best possible actions. The other extreme would be to keep the probability of each scenario occurring in the scores. That would however defy the purpose of the scenarios, as no change would be visible in the score moving from scenario to scenario. The remaining option I am seeing is to, as firstly stated, accept the scenario as given and attach a probability to each scenario on the axis, not integrating it into the scores. When eventually deciding on the most important actions to take, this probability will be used to judge each best-in-class option. While that is effectively just excluding the additional axes again, including them in the first place can be a good tool for visualizing core mechanisms and enable discussions about core disagreements.

The right dimensions for including scenarios would, as far as I can see, maximise the distance between different problems/ solutions. For example, if you believe future systems will still be reinforcement learning based, you will, unsurprisingly, weigh research into technical improvements to RL models far more heavily than if you think they’ll most certainly be very different.

Such dimensions could either be continuous (time to arrival), discrete (level of race intensity) or categoric, including completely differing (arrival) scenarios. Viktoriya Krakovna wrote about core dimensions of assumptions along which judgements can differ heavily and the needed portfolio approach to AI safety here.

Practical implications

This framework has implications both for the individual (researcher) and actors aiming to coordinate AI safety efforts. I am aware that these insights are hardly news but might be valuable as a write-up.

Individual level

Firstly, individuals working on AI safety must exchange with peers, both to discover their comparative (research) advantage and to coordinate on optimal research strategies. The primary aim in this sector is to move problems down the priority axis, either through reducing the problem’s priority through gaining clarity on which problems need to be solved (1), building solutions or enabling counterfactual development (2), or by increasing the time/ resources available to solve the problems (3). (3) Might however be a task that can only be undertaken by the community. This framework may help to map disagreements.

Secondly, they should aim to move problems from the top left to the top right sector by building relevant skills as prioritized by the community (which they may well also be a part of). This could also relate to doing foundational research to find some angles of attack for possible solutions, which would move problems to the right.

Community level

Actors trying to assume a coordinating role within the AI safety scene have two top priorities:

1. Enable cooperation: Establish a common understanding in the community about priority scores; support researchers in coordinating/ cooperating in their work (Think BERI)

2. Safety field management: Prevent important problems from being ignored, e.g. by identifying right scenario dimensions; Identify capital (human and financial) gaps and aim to close them; Influence the time available for closing research gaps (think FHI, CSER)

Generally, coordinators should try move as many problems away from the top left sector. They can do so by building more accurate priority scores (meaning some problems will become less relevant and others more) or by actively influencing their priority. Actively influencing them would include an increase in counterfactual development, e.g. by having commercial researchers work on them, or improving the impact of the community (move problems to the right). Again, the scope insensitivity of the framework allows for plotting solutions to “top-left” problems along the axes and identify best actions (top-right) to them.

Problems of this framework

Two main problems exist with this framework. For one, finding shared levels of abstraction might be very complicated/ unfeasible. As the assumption of a shared level of abstraction underlies the framework to be workable, this could be a large problem. I am currently not able to come up with descriptors of how to cluster certain problems on shared levels, so I would be very happy for any ideas here.

Secondly, differing scenarios might require contrasting solutions. If they are equally probable, unintuitive portfolios will arise. This could only be solved through including additional “scenario” dimensions described above.

Last words

I hope this write-up creates some clarity on how to think about clustering and prioritizing AI safety research. Either way, I enjoyed writing this and hope to post some more thoughts in the near future.

IMPORTANT: Any FEEDBACK is highly appreciated, especially since this is my first post! Thanks so much for reading!

Notes regarding priority score:

“Projected ability” is a counterfactual: If the AI safety community did not get active, how high do you think the chances are that a performance-oriented developer might solve it bc/ the solution is instrumental?
Necessarily, any current assessment of failure controls entails a “probability of a failure control failure”: This control failure in itself constitutes a next-level problem that can also be mapped onto the matrix.

Notes regarding marginal impact score:

1a) might include methods like:

1. Increase control over AI:

a. AI directly (value alignment through reward hacking prevention, stop-button problems, etc.)

b. Goal setting agent

2. Increase time to react before or after action is taken

3. Increase transparency of reasoning/ action

4. Reduce probability of control failure (malfunction)

5. Only for interventions: Increase corrigibility, i.e. recover damages after “bad” action

Towards a shared AI safety portfolio prioritization framework

Written by Karl Koch