Mike Shi
Mike Shi
Sep 19, 2018 · 1 min read

The unsatisfactory answer is that it’s arbitrary, CartPole v0 has a termination condition of 200 timesteps, and v1 was chosen to have 500. It’s done so that one episode doesn’t take forever (say we had a policy that could perfectly balance the pole center). So we say that if a policy can balance a pole for 500 time steps (and achieve 500 reward) it’s probably good enough.

Though you can sometimes find your policy almost failing to balance the pole or slowly shifting towards one edge over time as it nears 500 and would fail if the time horizon was longer. You can experiment with this and make it harder by changing the game to have a max score of 1k or 10k, though that would increase the time it takes to find a good policy (the game is harder, and each game takes 2x-20x longer). But if we do a lot of runs to average how the policy does at 500 time steps, bad policies would typically have a low average score, even if it manages to cheat the game once or twice.

You can do the above via: https://github.com/openai/gym/issues/463#issuecomment-389873434

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store