This blog is the first part of a three-blog series, which talks about basics of reinforcement learning (RL)and how we can formulate a given problem into a reinforcement learning problem.

The blog is based on my teaching and insights from our book at the University of Oxford. I also wish to thank my co-authors Phil Osborne and Dr Matt Taylor for their feedback to my work.

In this blog, we introduce Reinforcement learning and the idea of an autonomous agent.

In the next blog, we will discuss the RL problem in context of other similar techniques – specifically Multi-arm bandits and Contextual bandits

Finally, we will look at various applications of RL in the context of an autonomous agent

Thus, in these three blogs – we consider RL, not as an algorithm in itself but rather as a mechanism to create autonomous agents (and their applications)

This series will help you understand the core concepts of reinforcement learning and encourage you to build and define your problem into an RL problem.

**What is Reinforcement Learning?** – *“It is a field of Artificial Intelligence in which the machine learns in an environment setup by trial and error methods. Here the machine is referred to as an agent that performs certain actions and for each valuable action, a reward is given.* *Reinforcement learning algorithm’s focus is based on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).**”*

**Understanding with an example . . .**

Let’s go with the most common yet easy example to understand the basic concept of Reinforcement learning. Think of a new dog and ways with which you train it. Here, the dog is the **agent** and its surroundings become the **environment.** Now, when you throw a frisbee away, you expect your dog to run behind it and get it back to you. Here, throwing away the frisbee becomes the **state** and whether or not the dog runs behind the frisbee will depict its **action.** If the dog chooses to run behind the frisbee (an **action**) and get it back, you will **reward** him with a cookie/biscuit to indicate the positive response. If otherwise, some punishment can be given in order to indicate the negative response. That’s exactly what happens in reinforcement learning.

This interactive method of learning stands on four pillars, also called “The Elements of Reinforcement Learning” –

**Policy**– A policy can be termed as a way of tackling agent’s learning behaviour at a given instance. In a more generic language, it is a strategy used by agent towards its end goal.

**Reward**– In RL, training the agent is more like luring it to a bait of reward points. For every right decision an agent makes, it is rewarded with positive points, whereas, for every wrong decision an agent makes, a punishment or negative points are given.

**Value**– The value function works upon the probability of achieving the maximum reward. It is an algorithm that determines whether or not the current action in a given state will yield or help yield best reward.

**Model (optional)**– RL can either be model-free or model-based. Model-based reinforcement learning helps connect the environment with some prior knowledge i.e. it comes with a planned idea of agent’s policy determination with integrated functional environment.

**Formulating an RL problem . . .**

Reinforcement learning is a general interacting, learning, predicting, and decision-making paradigm. This can be applied to an application where the problem can be treated as a sequential decision-making problem. For which we first formulate the problem by defining the environment, the agent, states, actions, and rewards.

A summary of the steps involved in formulating an RL problem to modelling the problem and finally deploying the system is discussed below –

– Define environment, agent, states, actions, and rewards.*Define the RL problem*– Prepare data from interactions with the environment and/or a model/simulator.*Collect data*– This can probably be a manual task with the domain knowledge.*Feature engineering*– Decide the best representation and model/algorithm. It can be online/offline, on-/off-policy, model-free/model-based, etc.*Choose modelling method*– Iterate and refine the previous steps based on experiments.*Backtrack and refine*– Monitor the deployed system*Deploy and Monitor*

**RL framework – Markov Decision Processes (MDPs)**

Generally, typical reinforcement learning problems are formalized in the form of Markov Decision Processes, which acts as a framework for modelling a decision-making situation. They follow the principles of Markov property, i.e. any future state will only be dependent on the current state and independent of past states, and hence the name Markov decision process. Mathematically, MDPs are derived consisting of following elements –

- Actions ‘A’,
- States ‘S’,
- Reward function ‘R’,
- Value ‘V’
- Policy ‘π’

where, the end goal is to get the value of state, V(s), or the value of state-action pairs, Q(s,a) while there is a continuous interaction of the agent and environment space.

In the next blog, we will discuss the RL problem in context of other similar techniques – specifically Multi-arm bandits and Contextual bandits. This will expand on the problem of using RL to create autonomous agents. In the final part, we will talk about real-world reinforcement learning applications and how one can apply the same in multiple sectors.

About Me (Kajal Singh)

Kajal Singh is a Data Scientist and a Tutor at the Artificial Intelligence – Cloud and Edge implementations course at the University of Oxford. She is also the co-author of the book “*Applications of Reinforcement Learning to Real-World Data: An educational introduction to the fundamentals of Reinforcement Learning with practical examples on real data* (2021)”

**References –**