Aman Agarwal
Aug 28, 2017 · 1 min read

Hi Manuel. You’re completely right, they keep all 4 frames. Can you point out where you felt the ambiguity in my text?

This was the first time I mentioned this, in the Background section:

In Atari games the state of the game doesn’t change so much in every millisecond, nor is a human being capable of making decisions in every millisecond. So when we take video input at 60 frames per second, and treat each frame as a separate state, then most of the states in our training data will look exactly the same! It’s better to keep a longer horizon for what a “state” looks like, which has, say, at least 4 to 5 frames (say). We call this a sequence of a few consecutive frames, and use one sequence as a state.

And here’s the second mention, below paragraph 4.1:

The state S is basically preprocessed to include 4 different frames, all preprocessed into grayscale and resized and cropped to 84x84 squares. I think this is because given that the game runs at over 24 frames per second, and humans can’t react so fast as to make a move in each single frame, it makes sense to consider 4 consecutive frames as being in the same state.

Then the third, in the third paragraph of Section 5:

More detail about why they use a stack of 4 video frames instead of using a single frame for each state.

Thanks!

)

Aman Agarwal

Written by

Engineer, educator, environmentalist. Learner of foreign languages, lover of history, cinema and art.