In my last project I used a Q-Table to store the value of state action pairs. This works fine for a small state space such as the taxi game, but it’s impractical to use the same strategy to play Atari games, because our state space is huge. Therefore, I used a neural network to approximate the value of state action pairs.
Basically the neural network receives a state, and predicts the action it must take.
For breakout, the state is a preprocessed image of the screen. The original images are 210 x 160 x 3 (RGB colours). They are converted to grayscale, and cropped to an 84 x 84 box.
It’s impossible to understand the current state with just an image, because it doesn’t communicate any directional information. That is why the neural network is fed a stack of 4 consecutive frames.
Breakout has 4 possible actions.
- 0: do nothing
- 1: fire ball to start game
- 2: move right
- 3: move left
The training process starts off by having the agent randomly choose an action then observe the reward and next state. The data from this transition is then collected in a tuple, as (state, action, reward, next state, terminal).
The tuple is stored in a memory, which only stores a certain number of most recent transitions (in our case 350 000, as that’s how much ram google colab gives us). Once the agent has collected enough experience (50 000 transitions as laid out in Deepmind’s paper), we start fitting our model.
Every time step, the agent takes a random action with probability epsilon. Otherwise the state is given to the neural network, and it takes the action it predicts to have the highest value.
Epsilon decays linearly from 1.0 to 0.1 over a million time steps, then remains at 0.1. This means at the beginning of the training process, the agent explores a lot, but as training continues it exploits more.
Training the Network
Every time step, the agent chooses an action using based on epsilon, takes a step in the environment, stores this transition, then takes a random batch of 32 transitions and uses them to train the neural network.
For every training item (s, a, r, s`) in the mini batch of 32 transitions, the network is given a state (stack of 4 frames, or s). Using the next state (s`) and the Bellman equation, we get the targets for our neural network, and adjusts its estimate for the value of taking action a in state s, towards the target.
Basically what this is saying, is that if the next state is a terminal state, meaning the episode has ended, then the target is equal to just the immediate reward. Otherwise, the state action pair should map to the value of the immediate reward, plus the discount multiplied by the value of next state’s highest value action.
Traditionally, the value of the next state’s highest value action is obtained by running the next state (s`) through the neural network, like the same neural network we’re trying to train. But this can lead to oscillations and divergence of the policy.
So instead, we clone the original network, and use that to compute our targets. This gives the network we’re training a fixed target, which helps mitigate oscillations and divergence. The target network’s weights are updated to the weights of the training network every 10 000 time steps.
Here’s the entire algorithm: