Reliability amplification
Paul Christiano
11

“Given a distribution A over policies that [is] ε-close to a benign policy for some ε ≪ 1”

It doesn’t seem to be explicit how we could expect to achieve this property in ALBA (it looks like the github implementation trivially fails, because you deliberate amongst 3 copies of A, which looks like they will all think the same way and so fail on the same inputs).

In a traditional ensemble, we might divide the input up into 3 sets and train a separate agent on each dataset. This might give this property for errors that result from the inclusion of one specific bad training example in the agent’s dataset.

But I think a more concerning class of errors are those arising from some systematic reasoning error of the overseer. For example, with meta-execution there might be some accidentally arising misleading way of indirectly describing the input that leads to an incorrect answer.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.