Image classification is the Hello World of deep learning. For me, that project was Pneumonia Detection using Chest X-rays. Since this was a relatively small dataset, I could train my model in about 50 minutes. But what if I told you, that with just one additional line of code
we can reduce the training time (by 50% in theory) without any significant decrease in the accuracy. But first…
Why is this important?
The dataset I worked with, involved only around 4,500 images. However, if we scale the same project to a real world application, there’s probably going to be a lot more images. Take a look at Stanford’s Chexpert dataset.
Also, I trained a model to predict just one disease, but in a reality, we will have to predict a lot more than just two classes. In such scenarios, a reduce in the training time would really be a plus since it will ease the load on our resources. How do we do it?
Trending AI Articles:
1. Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data
2. Data Science Simplified Part 1: Principles and Process
3. Getting Started with Building Realtime API Infrastructure
4. How I used machine learning as inspiration for physical paintings
We become less precise. Jeremy says,
It turns out that sometimes, making things less precise in deep learning causes it to generalize a bit better.
In neural nets, all the floats i.e. our inputs, weights and activations are stored using 32 bits. Using 32 bits gives us a high amount of precision. But higher precision also means more computation time and more memory required to store these variables.
An idea to reduce memory usage is to do the all of the same things using half precision (16 bits).
By definition, this would take half the space in RAM, and in theory could allow you to double your batch size. The increased batch size would mean more operations performed in parallel thus reducing the training time. However there are some problems associated with this.
- Imprecise weight updates:
In the article on learning rates, we saw how weight updates happen. We basically do
w = w — lr * w.grad for every weight in our network. The problem with performing this operation in half precision is that
w.grad is usually really small and so is our
lr which can make the 2nd term of our equation so small that no updates happen at all.
2. Gradient underflow:
If our gradients are too small, they would be replaced by 0. More generally, many of the numbers representable by 16 bits would be unused while the ones below the representable range would become 0.
3. Activations or loss overflow:
The opposite of Problem #2, our activations can easily reach nan (or infinity) when we use half precision.
The way to solve these problems is to use mixed precision training.
Mixed Precision Training
Full jupyter notebook.
As the name suggests, we don’t do everything in half precision. We perform some operations in FP16 while the others in FP32. More specifically, we do our weight updates in 32 bit precision. This takes care of Problem #1.
To overcome gradient underflow, we use something called gradient scaling. We multiply our loss function by a scaling factor. We do so to avoid the gradient from falling below the range that can be represented by FP16 and hence avoid it from getting replaced by 0. They can then be unscaled before weight updates happen. The scaling factor should be large but not really large to cause loss overflow.
The idea of using mixed precision training has only been around for a couple of years, and not all GPUs support it. But its an idea worth knowing, and would be used a lot more in the future.
If you liked this article give it atleast 50 claps :p