Let’s have a look at deep Q-learning, that is, the algorithm employed in the DeepMind system to play Atari 2600 games at expert human levels. A basic understanding of how Q-learning works is a pre-requisite and won’t be covered here.
This article is an excerpt from the book Deep Reinforcement Learning Hands-On written by Maxim Lapan. This book is a practical, developer-oriented introduction to deep reinforcement learning (RL). Explore the theoretical concepts of RL, before discovering how deep learning (DL) methods and tools are making it possible to solve more complex and challenging problems than ever before.
While Q-learning solves issues with iteration over the full set of states, it still can struggle with situations when the count of the observable set of states is very large. For example, Atari games can have a large variety of different screens, so if we decide to use raw pixels as individual states, we’ll quickly realize that we have too many states to track and approximate values for.
In some environments, the count of different observable states could be almost infinite. For example, in CartPole the state given to us by the environment is four floating point numbers. The number of combinations of values is finite (they’re represented as bits), but this number is extremely large. We could create some bins to discretize those values, but this often creates more problems than it solves: we would need to decide what ranges of parameters are important to distinguish as different states and what ranges could be clustered together.
In the case of Atari, one single pixel change doesn’t make much difference, so it’s efficient to treat both images as a single state. However, we still need to distinguish some of the states. The following image shows two different situations in a game of Pong. We’re playing against the AI opponent by controlling a paddle (our paddle is on the right and has a green color, whereas our opponent’s is light brown and on the left). The objective of the game is to get the bounding ball past our opponent’s paddle, while preventing it from getting past our paddle. The two situations can be considered to be completely different: in the right-hand situation the ball is close to the opponent, so we can relax and watch. However, the situation on the left is more demanding: assuming that the ball is moving from left to right, then the ball is moving towards our side, so we need to move our paddle quickly to avoid losing a point. The situations below are just two from the 1070802 possible situations, but we want our agent to act on them differently.
As a solution to this problem, we can use a nonlinear representation that maps both state and action onto a value. In machine learning this is called a regression problem. The concrete way to represent and train such a representation can vary, but using a deep neural network is one of the most popular options, especially when dealing with observations represented as screen images. With this in mind, let’s make modifications to the Q-learning algorithm:
- Initialize Q(s, a) with some initial approximation
- By interacting with the environment, obtain the tuple (s, a, r, s′)
- Calculate loss: L = (Q(s, a) — r)2 if episode has ended or L = (Q(s, a) — (r + γ maxa′ ∈ A Q(s′, a′)))2 otherwise
- Update Q(s, a) using the stochastic gradient descent (SGD) algorithm, by minimizing the loss in respect to the model parameters
- Repeat from step 2 until converged
The above algorithm looks simple but, unfortunately, it won’t work very well. Let’s discuss what could go wrong.
First of all, we need to interact with the environment somehow to receive data to train on. In simple environments, like FrozenLake, we can act randomly, but is this the best strategy to use? Imagine the game of Pong. What’s the probability of winning a single point by randomly moving the paddle? It’s not zero but it’s extremely small, which just means that we’ll need to wait for a very long time for such a rare situation. As an alternative, we can use our Q function approximation as a source of behavior.
If our representation of Q is good, then the experience that we get from the environment will show the agent relevant data to train on. However, we’re in trouble when our approximation is not perfect (at the beginning of the training, for example). In such a case, our agent can be stuck with bad actions for some states without ever trying to behave differently. On the one hand, our agent needs to explore the environment to build a complete picture of transitions and action outcomes. Also, we should use interaction with the environment efficiently: we shouldn’t waste time by randomly trying actions we’ve already tried and have learned their outcomes. As you can see, random behavior is better at the beginning of the training when our Q approximation is bad, as it gives us more uniformly distributed information about the environment states. As our training progresses, random behavior becomes inefficient and we want to fall back to our Q approximation to decide how to act.
A method which performs such a mix of two extreme behaviors is known as an epsilon-greedy method, which just means switching between random and Q policy using the probability hyperparameter ε. By varying ε we can select the ratio of random actions. The usual practice is to start with ε = 1.0 (100% random actions) and slowly decrease it to some small value like 5% or 2% of random actions. Using an epsilon-greedy method helps both to explore the environment in the beginning and to stick to good policy at the end of the training.
The core of Q-learning procedure is borrowed from supervised learning. Indeed, we are trying to approximate a complex, nonlinear function Q(s, a) with a neural network. To do this, we calculate targets for this function using a Bellman equation and then pretend that we have a supervised learning problem at hand. That’s okay, but one of the fundamental requirements for SGD optimization is that the training data is independent and identically distributed (i.i.d).
In some cases, the data that we’ll be using for the SGD update won’t fulfill these criteria:
- Our samples might not be independent. Even if we accumulate a large batch of data samples, they all will be very close to each other, as they belong to the same episode.
- Distribution of our training data might not be identical to samples provided by the optimal policy that we want to learn. Data that we have is a result of some other policy (our current policy, random, or both in the case of ε-greedy), but we don’t want to learn how to play randomly: we want optimal policy with the best reward.
To deal with this nuisance, we usually need to use a large buffer of our past experience and sample training data from it, instead of using our latest experience. This method is called “replay buffer”. The simplest implementation is a buffer of fixed size, with new data added to the end of the buffer so that it pushes the oldest experience out of it. Replay buffer allows us to train on more-or-less independent data, but data will still be fresh enough to train on samples generated by our recent policy.
Another practical issue with the default training procedure is also related to the lack of i.i.d in the data, but in a slightly different manner. The Bellman equation provides us with the value of Q(s, a) via Q(s′, a′) (which has the name of bootstrapping). However, both states s and s′ have only one step between them. This makes them very similar and it’s really hard for neural networks to distinguish between them. When we perform an update of our network’s parameters, to make Q(s, a) closer to the desired result, we indirectly can alter the value produced for Q(s′, a′) and other states nearby. This can make our training really unstable, like chasing our own tail: when we update Q for state s, then on subsequent states we discover that Q(s′, a′) becomes worse, but attempts to update it can spoil our Q(s, a) approximation, and so on.
To make training more stable, there is a trick, called “target network”, when we keep a copy of our network and use it for the Q(s′, a′) value in the Bellman equation. This network is synchronized with our main network only periodically, for example once in N steps (where N is usually quite a large hyperparameter, like 1k or 10k training iterations).
RL methods use MDP formalism as their basis, which assumes that the environment obeys the Markov property: observation from the environment is all that we need to act optimally (in other words, our observations allow us to distinguish states from one another). As we’ve seen on the preceding Pong’s screenshot, one single image from the Atari game is not enough to capture all important information (using only one image we have no idea about the speed and direction of objects, like the ball and our opponent’s paddle). This obviously violates the Markov property and moves our single-frame Pong environment into the area of partially observable MDPs (POMDP). A POMDP is basically MDP without the Markov property and they are very important in practice. For example, for most card games where you don’t see your opponents’ cards, the cards in the deck are POMDPs, because current observation (your cards and cards on the table) could correspond to different cards in your opponents’ hands.
The solution is maintaining several observations from the past and using them as a state. In the case of Atari games, we usually stack k subsequent frames together and use them as the observation at every state. This allows our agent to deduct the dynamics of the current state, for instance, to get the speed of the ball and its direction. The usual “classical” number of k for Atari is four. Of course, it’s just a hack, as there can be longer dependencies in the environment, but for most of the games it works well.
The algorithm for DQN is summarized in the following steps:
- Initialize parameters for Q(s, a) and Qt(s, a) with random weights, ε ← 1.0, and empty replay buffer
- With probability ε select a random action a, otherwise a = argmaxaQ(s, a)
- Execute action a in an emulator and observe reward r and the next state s′
- Store transition (s, a, r, s′) in the replay buffer
- Sample a random minibatch of transitions from the replay buffer
- For every transition in the buffer, calculate target y = r if the episode has ended at this step or y = r + γ maxa′∊A Qt(s′,a′) otherwise
- Calculate loss: L = (Q(s, a) — y)2
- Update Q(s, a) using the SGD algorithm by minimizing the loss in respect to model parameters
- Every N steps copy weights from Q to Qt
- Repeat from step 2 until converged
We briefly touched on the topic of deep Q-learning. We initially had a look at some of the issues that you may face and made our way through solving each of them. Finally, a short step-by-step explanation of DQN training process was provided. Apply deep RL methods to training your agent to beat arcade games and board games, and navigate real-world environments including the stock market with Maxim Lapan’s book Deep Reinforcement Learning Hands-On.
Maxim Lapan is a deep learning enthusiast and independent researcher. His background and 15 years’ work expertise as a software developer and a systems architect lays from low-level Linux kernel driver development to performance optimization and design of distributed applications working on thousands of servers. With vast work experiences in big data, Machine Learning, and large parallel distributed HPC and nonHPC systems, he has a talent to explain a gist of complicated things in simple words and vivid examples. His current areas of interest lie in practical applications of Deep Learning, such as Deep Natural Language Processing and Deep Reinforcement Learning. Maxim lives in Moscow, Russian Federation, with his family, and he works for an Israeli start-up as a Senior NLP developer.