Gradient descent is the most common method for optimization.
This is the second part of the series Optimization, In this blog post, we’ll continue to discuss the Rod Balancing problem and try to solve using Gradient descent method. In Part-1 we understood what is an optimization and tried to solve the same problem using Exhaustive Search(a gradient-free method). If you haven’t read that here is the link.
Gradient-Based Algorithms are usually much faster than gradient-free methods, the whole objective here is to always try to improve over time i.e., somehow try to make the next step which results in a better solution than previous one. The Gradient Descent Algorithm is one of the most well-known gradients based algorithm. The Decision Variables used here are continuous ones since it gives more accurate gradients(slope) at any given point on the curve.
So, to solve our problem with gradient descent we’ll reframe our Objective function to a minimization problem. To do this we’ll make an assumption and define our cost function(it is also sometimes known as loss function or error function). We will assume that the best solution would be the one which can balance the rod for at least 10 seconds (let’s state this assumption as ‘y’). The cost function at its base is the function which returns the difference of the actual output and desired output. For our problem, the cost function would become:
Note: we squared the difference to avoid negative values or you can just take absolute value, either will work.
Now for every test result [f(x)] i.e., the time in seconds, the rod stayed on the finger, we can calculate our cost [C(x)]. So our Objective function would now change to minimizing C(x) instead of maximizing f(x), we can state the modified Objective function as,
Since Objective function is changed now, our curve also gets inverse, i.e., on the y-axis instead of time we plot cost and will try to minimize it.
Any Gradient descent based algorithm follows 3 step procedure:
1. Search direction.
2. Step size
3. Convergence check.
Once we know the error, we have to find the direction of where we should move our finger on the rod for a better solution. The direction is decided by taking the derivative of the cost function with respect to the decision variable(s). This simply means calculating slope(‘dC/dx’ ) on the curve for a specific value of the decision variable, this slope is known as the gradient. The greater the slope, the further we are from the minima(i.e., the lowest point on the curve).
For Gradient descent we apply a simple rule,
“If the slope is negative, we increase decision variable(s) and if the slope is positive, we decrease decision variable(s) with some value.”
Once we know the direction of where we want to take our variables for the next step we update them, The above rule can be easily given in mathematical term as,
But using this update rule may overshoot the value, resulting in skipping the minima and jump to the other side of the curve. So, instead of reaching to the centre of the rod the variable may jump and go to the other corner and may introduce greater error. To avoid this we decide the step size by multiplying a very small value(usually 0.001 or 0.0001) to the gradient, this value prevents overshooting as we are not taking a very huge step. This is known as the learning rate(α), unlocks the key principle where Gradient descent shines,
“Big steps when away, small steps when closer.”
What above statement says is when the slope is greater(i.e., when the steepness is high) the variables will update with larger values and when the slope starts getting smaller(i.e., the steepness is low, reaching to the bottom) the variables will update with a very small value, This is the behaviour what we actually follow in the real world while solving this kind of problems. So our update rule now changes to,
We perform this operation of updating variable for a certain number of epochs(one complete pass to the training examples) until it converges.
Below is the video of me trying to solve the rod balancing problem using the gradient descent method, you‘ll notice how fast, compare to exhaustive search we discussed in part-1, we get the point ‘x’ on the rod where it balances perfectly.
2. Using Artificial Intelligence to detect COVID-19
3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code
4. Machine Learning System Design
solving Rod balancing problem by Gradient descent method
How the value of x converges to the minima, can be shown by the graph below.