I think that this is essentially the same issue as our learner suspecting that it is living in a simulation (which may be either in the far future, or controlled by an adversary in the modern world).
This is a very distinctive kind of inference that is quite unlike the reasoning of contemporary systems and seems like it would only be possible with major algorithmic advances. I think that there is a good chance that the actual learning systems we build won’t make these kinds of inferences (maybe 50–50? I discuss the issue here). Moreover, I think that there is a much better chance that if we try hard we will be able to build systems that don’t take the possibility seriously. I discuss some possible remedies here.
The main reason not to focus on the problem now is that it seems (1) to be very difficult to think about without more knowledge of future learning systems, (2) quite likely that with such knowledge in hand we will be able to resolve the issue in a modular way, and (3) quite likely that we will get a lot of warning, i.e. we will build systems that exhibit or potentially exhibit this behavior long before it becomes a serious problem. [ETA: Also, I think that most mainstream AI researchers consider it a wacky and relatively unpromising direction, and I think that this both constitutes some evidence and is relevant for AI control as a field.]
I will probably spend some of my time thinking about the question, probably by further probing the distinctions described here. And I think that if the AI control community were bigger it would be reasonable for people to be thinking about this issue. But for now I think (a) there are higher research priorities, and (b) this concern is too speculative, and our understanding too preliminary, to have a big effect on what directions we explore.
I think the issue is less of a problem for internal systems for the same reason that it seems like less of a concern for weak systems. A quick heuristic for allocating memory simply isn’t going to freak out about simulations, even if you use gradient descent to train it. And at the scale where this becomes a problem, it seems likely we can already have mostly made the switch to unsupervised learning / an abstract definition of approval.