Now lets discuss about the update process. Q-Learning utilises BellMan Equation to update the Q-Table. It is as follows,
In the above equation,
Q(s, a) : is the value in the Q-Table corresponding to action a of state s.
r(s’) : is the reward received by entering into new state s’. Imagine that if new state(s’) is the goal, then reward received is 1(suppose) and if s’ is a wall, then the reward is -1.
Q(s’, a’) : It to is the value in the Q-Table corresponding action a’ of state s’.
max() : It gives the maximum of all values given as input to this function. Since in s’, there are four actions, we retrieve all four action values in state s’ and uses the maximum of them.
: It is the discount factor. It will support not to update using the overall value but to use a fraction of it. The advantage of doing this is to minimize any naive decisions. For example, if state s’ has its maximum action value in the action right, we don’t know whether it’s correct or not. So, if we update the value only to a fraction of its original, then it is somewhat better compared to former.
Look the above equation this way,
Q(performing action a at state s) =( reward received in new state s’) + (maximum of all possible actions in new state s’ at a discounted rate).
Well, the above equation doesn’t look like updating a value but it actually is rewriting a value. We use a slightly different version of this equation to update
Here, we are updating the value by adding the newly calculated value multiplied with some constant to the initial value itself. Here,
α is the learning rate. The purpose of learning rate is used to update the value using a fraction of the full value in order to employ smoother learning process.
Ergo, the story here is that, our agent in state s, will pick a random action a and receive a reward r(s’) and updates the value of action a in state s according to the above equation.