The unsatisfactory answer is that it’s arbitrary, CartPole v0 has a termination condition of 200 timesteps, and v1 was chosen to have 500. It’s done so that one episode doesn’t take forever (say we had a policy that could perfectly balance the pole center). So we say that if a policy can balance a pole for 500 time steps (and achieve 500 reward) it’s probably good enough.
Though you can sometimes find your policy almost failing to balance the pole or slowly shifting towards one edge over time as it nears 500 and would fail if the time horizon was longer. You can experiment with this and make it harder by changing the game to have a max score of 1k or 10k, though that would increase the time it takes to find a good policy (the game is harder, and each game takes 2x-20x longer). But if we do a lot of runs to average how the policy does at 500 time steps, bad policies would typically have a low average score, even if it manages to cheat the game once or twice.
You can do the above via: https://github.com/openai/gym/issues/463#issuecomment-389873434