Intelligence boils down to two things for me ->
1. Acting when certain/necessary.
2. Not acting/staying pensive when uncertain.
Point (2.) is what we are going to dive in!
Uncertainty is inherent everywhere, nothing is error free. So, it is frankly quite surprising that for most Machine Learning projects, gauging uncertainty isn’t what’s aimed for!
1. How to automatically deskew (straighten) a text image using OpenCV
2. Explanation of YOLO V4 a one stage detector
3. 5 Best Artificial Intelligence Online Courses for Beginners in 2020
4. A Non Mathematical guide to the mathematics behind Machine Learning
As a not-so-real-world example, consider a treatment recommendation algorithm. We are given a patient’s medical data that we feed into our network. For a plain NN(Neural Network), it would just output a single class, say treatment type ‘C’. While for BNNs you would be able to see the whole distribution of the output, and gauge the confidence of your output label based on that. If the standard deviation is low of your output distribution, we are good to go with that output label. Otherwise, chuck it, we need human intervention.
So, BNNs are different from plain neural networks in the sense that their weights are assigned a probability distribution instead of a single value or point estimate. Hence, we can assess the uncertainty in weights to estimate the uncertainty in predictions. If your input parameters are not stable, how can you expect your output to be?! Makes sense, eh.
Say you have a parameter, you try to estimate its distribution and using the distribution’s high probability points you estimate the output value of your neural network. By high probability, I mean the more probable points. (Mean is the most probable point in a Normal distribution)
Looking at Equation #1,
On the left hand side, we have the probability distribution of the output classes, which we get after feeding in our data ‘x’ to our model which has been trained on Dataset ‘D’.
On the right hand side, we have an integral. The middle term inside the integral is the posterior, which is a probability distribution of the weights GIVEN we have seen the data.
Think of this integral as an ensemble, where each unique model is identified by a unique set of weights (because different weights means different models). Greater the posterior probability for a unique set of weights(therefore that unique model), greater weight will be given to that prediction. Hence, each model’s respective prediction is weighed by the posterior probability of it’s unique set of weights!
Sounds good, eh? Just search the whole weight space and weigh in the good(high probability) parts. Wondering why people don’t do this? It is because even a simple 5–7 layer NN has around a million weights, so it is just not computationally feasible to construct an analytical solution for the posterior(p(w|D)) in NNs.
So the next step is, we need to approximate the posterior distribution. We cannot get the exact posterior distribution, but we surely can choose another distribution that replicates it to a good extent.
We can do this using a variational distribution whose functional form is known! By ‘functional form is known’, I mean it is one of the standard statistical distributions which can be denoted using just a few parameters, like the Normal distribution( We just need two parameters, i.e. the mean and variance, to denote a Normal distribution). So, we are essentially trying to form a Normal distribution replica of the posterior distribution. Even though the actual posterior distribution might not be a Normal distribution, we are still going to replicate it as well as we can using the variational distribution, which is a Normal distribution in our case. You can pick any standard statistical distribution for the variational distribution.
How do we go about replicating it? I’ll cover that and more in the next article. Hope this works up an appetite for the world of Bayesian Machine Learning! It’s a beautiful topic and one that has got a lot of exploring to do.
So yeah, see you!