You are correct to observe that using a simple Q-learning algorithm on CartPole will fail. Due to the nature of the state space in CartPole it is very difficult for a basic Q algorithm to solve it. In fact, the Q-learning algorithm described here is almost never used for large or continuous state/action spaces. Instead DQN, with it’s augmentations to improve robustness is used. Or a policy gradient method as you mentioned.