A box which can imitate a human is weaker (and easier to construct) than a box which can maximize human approval, at least when the action space is large.
But a box which can imitate any physical system (including other copies of itself) is equivalent to a box that can maximize approval of any physical system (including copies of itself), both in terms of power and feasibility.
Act-based agents are intended to receive training data produced by a system involving both humans and other act-based agents. So the situation is more like the second case, and the important distinctions depend on characteristics of a particular learning system.
A learning system can neither imitate nor maximize approval very well in this setting, and doing either perfectly would yield good results. The question is what happens when the system does a mediocre job of imitation vs approval-maximization.
For now I do see a number of problems with using imitation, but I think that the situation is not nearly so clear, and the obstacles not nearly so fundamental, as you suggest.