I think we need to distinguish between damage caused by a lack of philosophical understanding in year 2030 (like your bargaining example) and damage caused by failing to ever converge to reasonable philosophical views.
I am skeptical that philosophical ignorance in 2030 is going to cause much damage.
I agree that philosophical ignorance forever is going to cause damage— I don’t think we are going to “get lucky” and happen to do something that is good according to a correct understanding of what is good. I’m happy to basically write off all worlds where people don’t eventually converge to a correct understanding of what is good.
But I am skeptical of the plausibility of philosophical ignorance forever. In the comment you are responding to I explained why I didn’t understand how global coordination would lead to such failure to converge to correct understanding of what is good (as far as I can tell it should increase rather than decrease the probability).
“Making AI defer to future human reflection” seems like it is a necessary part of successful AI control, assuming that people want to preserve the ability to engage in future reflection. If I want to continue existing, and to continue having the ability to think about what is good and to have my conclusions influence what the AI does, then that’s what an aligned AI will do. I don’t really know what it would mean for an AI to be “aligned,” but to kill me and never make any effort to figure out what the outcome of my deliberation would have been, or to allow me to deliberate but to ignore the results. (Again, unless I don’t have any preference to defer to the outcome of my own deliberation — but that would be a problem independently of whether we build AI, it’s the problem I was expressing skepticism about above.)
So I guess this all probably comes down to disagreement about the difficulty of AI control (under my definition of AI control).
I still don’t really understand why you are so pessimistic regarding philosophical deliberation in particular. This seems basically orthogonal to the legitimate concerns about the kind of AI control that I’ve been working on.
To refresh my memory, do you think deferring to human philosophical deliberation would be a problem for the kind of indirect normativity described in this post? Do you think this would be a problem for a similar proposal, but in which we replaced the human with an alien who lacked the knowledge needed to infer human values / philosophical conclusions? (But retained the motive to be helpful, the ability to recognize humans, being universal in some sense, etc.) While it’s clear that those aren’t workable AI control proposals, the additional challenges of real AI control proposals seem kind of orthogonal to deferring to human philosophical deliberation.
I do remember concerns about e.g. predictable value divergences between the person in the thought experiment and the real people. In an actually workable AI control proposal it seems to me that those will have to be the same people (since you have to actually be collecting the training data) so this isn’t a concern that I’ve really thought about.