Now lets discuss about the update process. Q-Learning utilises **BellMan Equation** to update the Q-Table. It is as follows,

In the above equation,

** Q(s, a)** : is the value in the Q-Table corresponding to action

**a**of state

**s**.

** r(s’)** : is the reward received by entering into new state

**s’**. Imagine that if new state(

**s’**) is the goal, then reward received is

**1**(suppose) and if

**s’**is a wall, then the reward is

**-1**.

** Q(s’, a’)** : It to is the value in the Q-Table corresponding action

**a’**of state

**s’**.

**max()** : It gives the maximum of all values given as input to this function. Since in **s’**, there are four actions, we retrieve all four action values in state **s’** and uses the maximum of them.

** : It is the ****discount factor**. It will support not to update using the overall value but to use a fraction of it. The advantage of doing this is to minimize any naive decisions. For example, if state s’ has its maximum action value in the action right, we don’t know whether it’s correct or not. So, if we update the value only to a fraction of its original, then it is somewhat better compared to former.

Look the above equation this way,

Q(performing action

aat states) =( reward received in new states’) + (maximumof allpossible actionsin new states’at adiscounted rate).

Well, the above equation doesn’t look like updating a value but it actually is rewriting a value. We use a slightly different version of this equation to update

Here, we are updating the value by adding the newly calculated value multiplied with some constant to the initial value itself. Here,

**α** is the** learning rate**. The purpose of learning rate is used to update the value using a fraction of the full value in order to employ smoother learning process.

Ergo, the story here is that, our agent in

state s, will pick a randomaction aand receive areward r(s’)andupdatesthevalue of action a in state saccording to the above equation.

Credit: BecomingHuman By: Sai Kumar Basaveswara