Give yourself a head-start by seeing the big picture first.
TensorRT is an SDK (Software Development Kit) that allows you to achieve high performance inference. The main focus of TensorRT is to receive a trained machine learning model and optimize it for deployment to achieve higher throughput and lower latency.
By leveraging all the TensorRT 8 optimizations with A100 GPUs, it is possible to achieve 100 times the deployment performance, as Hugging Face achieved it with their transformer models. If you wonder what this actually looks like in real life, here is an example that can help you visualize. If you are using a BERT Large model, which is industry standard model for production, you can achieve 1.2 milliseconds inference time for your natural language queries.
TensorRT can work with all the top machine learning frameworks including TensorFlow and PyTorch. Depending on the framework you use there are multiple ways to leverage TensorRT.
In this article you will discover 3 main ways you can use to convert your TensorFlow model to TensorRT. You will also see which pathway is better suited for which use cases and hopefully this will help you chose a pathway to get started with TensorRT.
In a nutshell, if you are working with a TensorFlow model, there are 3 main conversion paths you can use. These paths can be listed as:
- TF-TRT (TensorFlow-TensorRT) integration,
- TensorFlow-ONNX-TensorRT workflow
- Manually reconstruct the neural network using TensorRT API using Python or C++
1) TF-TRT integration
TF-TRT integration is the simplest one to get started with. If you are relatively new to machine learning I actually recommend getting started with this one. It does not give you the full optimization benefits of TensorRT but gives you a starting point all from within TensorFlow. The conversion takes a couple lines of TensorFlow code and you can deploy the resulting optimized network within TensorFlow as well.
One thing you should know about converting models between different frameworks is that, depending on the model layers and operations you use, there is a chance that there will be incompatible operations. This is especially true if you are working with relatively new machine learning models.
What happens with the TF-TRT integration is that, only the compatible parts of your machine learning model will be optimized with TensorRT optimizations and the incompatible parts will stay as TensorFlow operations. This conversion path also leaves your model as a TensorFlow Saved Model and you will essentially be able to leverage TensorRT optimizations while keeping your model as a TensorFlow model. After the conversion is complete you can also use TensorBoard to inspect your model and see what parts of your model is optimized. The compatible parts of your operations will run as TensorRT operations and the incompatible parts will execute as TensorFlow operations.
2. How AI Will Power the Next Wave of Healthcare Innovation?
3. Machine Learning by Using Regression Model
4. Top Data Science Platforms in 2021 Other than Kaggle
Larger the compatibility of your neural network with TensorRT, more the optimization benefits you will see. If all the operations you have in your TensorFlow model is compatible with TensorRT, then totality of your model will be optimized and will be run with TensorRT.
Another optimization you can make with this pathway is on top of converting your TensorFlow model to TensorRT, you can also train your machine learning model with what is called Quantization Aware Training. This essentially allows your model to perform with FP32 accuracy at INT8 precision. This is made possible by introducing a fake quantization to your model during training, therefore your model has the opportunity to adapt to the impacts of quantization before being quantized. Quantization Aware Training is one of the 3 major features released with TensorRT8. The two other major advancement in TensorRT 8 are Transformer Optimizations and Sparsity Support for Ampere GPUs.
If you want to learn more about the latest advancements on TensorRT 8, you can find an article on that right here:
If you want to dive right in and start experimenting with TensorFlow-TensorRT integration here is a good place to start:
Colab-TF20-TF-TRT-inference-from-Keras-saved-model.ipynb — Colaboratory (google.com)
2) TensorFlow-ONNX-TensorRT workflow
This workflow provides a more performant way to deploy your models. If you are not going to deploy with TensorFlow and want to further optimize your model to take full advantage of all the major TensorRT 8 features in all of your model, then I highly recommend using this workflow. This conversion path will create a singular TensorRT engine that is highly optimized.
Part of the reason is that the conversion to an ONNX model is an all or nothing conversion. For a successful conversion to happen all of your TensorFlow operations will have to be compatible with ONNX format. For the operations that are not supported by ONNX you can either replace them with similar operations or you can write custom code for the unsupported operations to make the successful conversion to ONNX model format.
If you have no idea what ONNX is by the way thats okay too. For now you should know that ONNX stands for Open Neural Network eXchange format and it is a rather popular format that can accept many machine learning frameworks for conversion, including TensorFlow and PyTorch.
After the ONNX conversion, the next step is to convert the ONNX model into a TensorRT network, also called a TensorRT engine. The resulting conversion will create a unified TensorRT engine that will give you a higher performance.
If you want to get started with the TensorFlow-ONNX-TensorRT workflow here is a notebook to get you started:
TensorRT/EfficientDet-TensorRT8.ipynb at master · NVIDIA/TensorRT · GitHub
3) TensorRT API using Python or C++
This pathway provides the most customizability and performance possible. I recommend using this conversion path if you want to get the most performace physically possible from TensorRT 8.
With TensorRT API, what you do is to basically recreate the entire neural network using only TensorRT operations. Once you have an identical TensorRT network, the next step is to export only the weights from your TensorFlow model and load the weights into the newly created TensorRT network.
To create a network using only TensorRT operations you can either use the Python API or the C++ API.
Both of them will give you almost identical results. The main benefit of the Python API is that for the preprocessing and post-processing, the use of popular Python libraries such as SciPy and NumPy can make it more convenient.
The main benefit of going with the C++ API is that it provides additional safety and should be the used for applications in sensitive industries like the automative or aviation industry.
Here is a notebook for your convenience to help you get started with the Python API using BERT:
TensorRT/demo/BERT/notebooks at master · NVIDIA/TensorRT · GitHub
All three conversion paths has different advantages and capabilities. The TensorFlow-TensorRT integration allows you to get up and running quickly. The ONNX pathway allows you to make sure that all of your model is optimized and further increases your performance. The TensorRT API gives you the most customizability and performance possible.
What I recommend is that if you are just getting started with machine learning and want to start experimenting with deployment optimizations, or if you know that you want to deploy with TensorFlow, is to go with the TF-TRT integration.
If you are not going to deploy with TensorFlow and you want to get significant boost in your deployment inference performance, then I recommend going with the TensorFlow-ONNX-TensorRT workflow.
If you are pretty comfortable with machine learning and want to get the most control and performance physically possible, then I recommend going with the TensorRT API.
For your convenience, all three of the workflows, side by side, looks like this:
I hope you got some value out of this article. If you have any questions leave them as a comment and I will get back to you as fast as I can. If your question requires an extensive explanation I can also explain that in another article.
If you want to learn more about TensorRT8 or just want to get started with it, here is a good place to start: