“Given a distribution A over policies that [is] ε-close to a benign policy for some ε ≪ 1”
It doesn’t seem to be explicit how we could expect to achieve this property in ALBA (it looks like the github implementation trivially fails, because you deliberate amongst 3 copies of A, which looks like they will all think the same way and so fail on the same inputs).
In a traditional ensemble, we might divide the input up into 3 sets and train a separate agent on each dataset. This might give this property for errors that result from the inclusion of one specific bad training example in the agent’s dataset.
But I think a more concerning class of errors are those arising from some systematic reasoning error of the overseer. For example, with meta-execution there might be some accidentally arising misleading way of indirectly describing the input that leads to an incorrect answer.