Whether you are building a machine learning model for research or for a business function, the whole of point of creating a model is to perform inference. Currently, TensorRT provides the most performant way to achieve just that. And TensorRT 8 takes it to the next level. In this article you will discover the latest capabilities of TensorRT8.
When you create and train a machine learning model, your model can technically perform inference. But chances are it is far from being optimized for inference. Depending on the framework you use sometimes this difference can be hidden from you for simplification, or it can be more explicit.
TensorRT 8 comes with pretty significant advancements over the existing TensorRT 7. Let’s unfold 3 major advancements and what they mean for you.
Quantization Aware Training
Quantization in machine learning is not a new concept. It is overall a good practice and in many cases it is necessary to certain extend to achieve significant speed gains as well as achieving much lower memory footprint.
In the common approach to quantization, which is also called Post Training Quantization, or PQT for short, could mean significant loss in accuracy depending on your machine learning model. Depending on your use case this loss in accuracy can be okay, if loss in accuracy is not at a critical level and maybe keeping a small memory footprint or having a faster execution time is more critical for your application.
With Quantization Aware Training, what you do is to partially include the quantization step within training. I am saying partially included, because what you include is a separate Quantize-Dequantize step within your training. You do not have to start traning a model from scratch and if you have a trained model, you can train it one more time with the Quantization Toolkit. As a result of this you will still keep your FP32 precision in your model initially. However when you perform the quantization, the weights of your model will adapt much better to quantization as they previously been “fake quantized”. This way in many cases you can attain the initial level of accuracy with an INT 8 precision, rather than FP32. Which is basically getting the best of both worlds, all the benefits of quantization for your memory and speed considerations with almost none of the accuracy loss.
Sparsity Support for Ampere GPUs
This advancement does not apply to all GPUs. And this is for a good reason. Ampere generation GPUs have something that others don’t. It is called sparse kernels. These sparse kernels target a very common problem in machine learning. The problem is that a significant percentage of your operations don’t actually need to be calculated.
Imagine that you have to multiply thousands of numbers with with pen and paper. But when you look at the paper, you see that there are a lot of complex numbers that have to be multiplied by zero. Your first instinct in this case would be to not calculate the zero multiplications, but simply to write zero right next to them.
1. Why Corporate AI projects fail?
2. How AI Will Power the Next Wave of Healthcare Innovation?
3. Machine Learning by Using Regression Model
4. Top Data Science Platforms in 2021 Other than Kaggle
This is almost the exact problem we want to solve, but for a machine learning model. How do you tell a GPU to ignore the right calculations, that include multiplying zeros or numbers very close to zeros in a systematic way?
Sparse kernels to the rescue. With the right software that can leverage that sparse kernels. With TensorRT 8, you can leverage sparse kernels in Ampere generation GPUs, such as A100.
What happens behind the scenes is that for every four consecutive weights in your machine learning model, two of them that are closer to zero, are turned into zeros. And the locations information of which weights are still used is stored with a relatively small memory footprint. Because the process of picking which weights do you need is very systematic this sparsity works with variety of machine learning models and architectures.
Transformer based machine learning models totally took over 2020 by storm. At first they hit the headlines with state of the art natural language models such as with GPT2 and GPT3 by OpenAI, but later it is discovered that they are also really good at performing variety of tasks with state of the art performance. From generating images form text with OpenAI DALL-E, to simply connecting images and text for many widespread applications including image recognition.
One major difference of transformer based networks form other types of networks such as convolutional neural networks, is that transformer based networks have a wide range of possibilities in terms of the structure of the operations they can have.
TensorRT 8 adapts better to those structures and as a result will give you up to double the performance compared to the previous version of TensorRT, which was version 7.
TensorRT under the hood perform multiple optimizations to achieve this. It separates the transformer network into smaller parts and creates highly optimized kernels to be executed on the GPU. Not only that the created kernels are also fused together to decrease the number of total operations and decrease the overall execution time.
All three of these updates bring significant performance boost on their own. But ever wondered what would it look like if you were to train a large industry standard transformer based language model and deploy it with sparsity?
Well this is exactly what HuggingFace did, when they integrated a fully optimized BERT-Large Model with TensorRT 8 and deploy with A100 GPUs. The results? Achieving record-breaking performance at 1.2 millisecond inference time. That is almost 1/1000 of a second to get a response to your question!
The overall optimizations resulted in HuggingFace APIs to be 100x faster with fully integrated with TensorRT 8 features and A100 GPUs.
For a query to be accepted as a “real time” communication it has to be under 10 milliseconds, which can be a significant challenge for many companies. Not only the model size has to be significantly large to respond with higher quality answer, but also the execution should be optimized from many aspects to have a low latency high quality output.
I hope you got some value out of this article. If you have any questions leave them as a comment and I will get back to you as fast as I can. If your question requires a more in depth explanation, I can also explain that in another video or an article. If you are a visual learner by the way, I also have this article in video format here:
If you want to learn more about what can you do with TensorRT8 or if you just want to get started with it, a good place to start is here: