I wanted to name this Adventures in Reinforcement Learning. Then I realized that it was probably the lamest name I could ever create.
What is reinforcement learning?
Wikipedia can explain it better.
Then why do you even have a blog post?
Well I had to take a graduate level AI class to understand Reinforcement Learning enough for me to try playing around with examples I found online and tweak them to my interests, ultimately creating something in an hour while listening to RetroWave. This is like an introduction to what this entire thing is about, at least at a very high level.
Let’s assume it’s Spring Break, where your buddies and you finally decide to go to Las Vegas (If you’re very rich, Monaco). On the way you pick up the son of a Mafia Boss / Casino Owner whose car has broken down and as a token of gratitude (while still trying to maintain his business interests) he says you can do any number of “dry runs” on his slot machines, where you neither lose money nor gain anything after each run of the slot machines. After you’ve made a decision, you can choose a slot machine to play a round on. Win-win, right? However, the Casino is heavily influenced by Costco, and your only option is to either run a slot machine a set number of times each time you wish to play, or not run it at all. Seeing your indecisiveness, the Mafia Boss offers a cash payout (which is lesser than what you’d win if you actually won).
What do you do?
Each time you play, choose the machine you feel would be most appropriate for your cause. Make sure that the reward probability is high enough, and the reward amount is high enough as well!
But of course, you being you, (or at least in my case) wouldn’t want to do something hundreds of times, right? So you write a program that does that for you.
Let’s dive in.
Numpy is a must-have for automating anything related to math, so you import it. Matplotlib only helps you visualize the performance of the system over time, but it’s really cool to see how the law of large numbers works!
numIterations is the number of times you’re supposed to run the machine each time you play.
numMachines is the number of machines (Hello Captain Obvious).
eps is the probability of performing the exploratory action. There are two steps you can take in the process of learning, exploration where you randomly choose an action, and exploitation where you act based on what you already know. Large values mean higher probabilities of choosing random actions.
gamma is the discount factor for previous iterations. Large values (but always < 1) can imply more “conservative” agents that take a lot of their previous experience into account, while small values (always > 0) make the agent more “radical”. This is to ensure that newer experiences are worth more than older ones.
playCost is the cost of performing one action. (In this case, the money you must put in the slot machine to play
tempDelta is the decay factor for the exploratory action probability. As you start getting more data, it just doesn’t make sense to continue with the brashness and randomness of the initial case. Instead, you try to use the knowledge you already have (which becomes more and more reliable as time passes). I’ve called this
tempDelta because it’s similar to the temperature drop associated with hill-climbing methods like Simulated Annealing.
This is pretty straightforward. Given a probability, it generates a reward amount, influenced by the number of times a pseudo RNG generates a value less than the probability of success.
This is another straightforward piece of code. Chooses the most rewarding slot machine of them all, based on a rewards array that corresponds to each slot machine.
Rewards, by tradition and in order to eliminate any bias in working, are initialized to 0. You don’t know anything, so you don’t assume anything, right?
The probabilities array is very interesting. This is a random value that is constant throughout the execution. In your hypothetical Las Vegas Casino, assume that the “ever-so altruistic” Mafia Boss has given you these values. The probabilities array gives the probability of you getting a reward for the indexed slot machine.
While we were at it, we also defined the labels for the axis of the graph.
This is the heart of our operation. We calculate a running mean to understand how well things are happening as of now, and update the reward associated with each slot machine.
If the step is an exploitation step, choose the best slot machine for your task. If not, choose a random one. Calculate the reward, but remember, you have to pay the piper!
Then it’s time for the belle of the ball. The running mean is updated, and the rewards are updated as well.
gamma comes into play as the old knowledge is decayed a little and the newer reward is added to it. Also, the probability of choosing a random action also lowers itself!