Let’s get started with the code. The corresponding notebook can be found here.
I started with importing the libraries and dependencies.
Next I initialized some constants to use later in the algorithm.
I created a class named Agent and initialized it with state, action, noise and replay memory for both the actor and critic networks.
Next I created a function named step to make sure the agent is learning using the parameter tuples (state, action, reward, next-state).
I also made a function act for returning actions for the given state as per the current policy using noise as an additional parameter.
After that I made a couple of functions named reset for resetting the noise value and learn for making sure the agent is learning a policy by updating the actor and critic losses.
I continued with writing a function for soft-updating.
The second class I made was for noise that is for making sure that the agent generalizes well for every state and every action. It has functions for resetting the state to mean value and for updating the state and returning it as a noise sample.
The third class was for storing experience tuples (state, action, reward, next-state) in memory. It supports functions for adding a new experience to memory, for randomly sampling a batch of experiences from memory and for returning the current size of memory.
The fourth class was for defining the actor network. It has functions for resetting parameters and for building a model that maps states to actions.
I continues with making a critic class. It also has functions for resetting parameters and for building a model that maps states, actions pairs to Q-values.
Then I initialized the Bipedal Walker environment.
Next I implemented the DDPG algorithm as shown below.
Let’ see the results.