It seems like judgmental bootstrapping presumes that one has access either to trusted expert forecasts, or else to some way of determining retrospectively which forecasts are accurate — without these we can’t get data to train our model.
In this case, we can resolve the arguing problem by rewarding agreement with expert forecasts (and perhaps only actually checking against those forecasts with small probability), or else retrospectively by seeing which predictions were best after the fact.
If we have access to expert forecasts (or retrospective reliability tests) in one domain then we could try to transfer the learned models to a new domain in which forecasts aren’t available. I have been trying to make minimal assumptions about this kind of transfer, and moreover I would want to elicit expert-level views about “what are the important differences between the old domain and this new domain.” If we don’t have expert forecasts in the new domain, then it can’t be too similar to the old domain: there must be some difference that is responsible for the inavailability of expert feedback in the new domain. In the AI case, these tend to be just the kinds of transfer that are most likely to be problematic.
I am most interested in the case where we lack trusted experts, and where we can’t use simple rules to evaluate performance after the fact (typically because the problem is not a simple prediction problem — it is a design problem, or a very long-range prediction, or some other kind of question altogether). These are the hard cases for AI control — if we have trusted experts we can use imitation learning, which is basically the same as judgmental bootstrapping; if we have ex post evaluations that capture what we care about then we can use traditional reinforcement learning.
I’m also interested in trying to reduce our reliance on this kind of information, even in domains where it is available, so that our techniques will better generalize to future domains where it isn’t available.
The hope is then for evaluations-of-explanations to serve as an adequate substitute for feedback from nature. This is what allows us to replace “optimizing an explicit reward signal in nature” with “optimizing the values implicit in the user’s head.”