I don’t totally understand the claim here. You seem to be saying: “We want our AI to take actions which will have certain consequences, i.e. to produce code which will have certain consequences when run. And that problem is essentially about reasoning from consequences, so it is harder to do within the approval-directed framework.” Of course I don’t agree with the italicized implication.
This isn’t specific to self-improvement. If this kind of goal-directed behavior can’t be captured by approval-directed systems, then obviously they are completely unsuitable as an alternative.
I agree that it is intuitively plausible that going through approval-direction or imitation would introduce additional costs. But that is precisely what I am trying to argue against.
As an example, consider a simple goal-directed agent which searches for an action that will have a high estimated value. In order for an approval-directed system to do the same thing, approval ratings would need to reflect these estimated values. But given that the overseer can consult the exact same algorithms that the goal-directed agent uses to estimate values, that seems totally realistic.
Of course that’s not how a goal-directed agent will work; there are other internal dynamics of a goal-directed agent, and some of those might turn out to be hard to “port” to the approval-directed setting. But I am interested in actually seeing what those hard-cases are (so far I haven’t seen any apparent deal-breakers) rather than relying on loose argument about goal-directed reasoning requiring goal-directed reasoning.
There do seem to be serious problems with implementing act-based agents, which these research directions are aimed at and which I want to work on.