This is the third part of the series of articles about Computer Vision for mobile and embedded devices. Last time I discussed the most vital part of it — Select the right tool for On-Device ML model inference.
Inbounds of this article I am going to talk about the final step — postprocessing
Let’s find out what is the output of the ML model for Computer Vision application. Probably in most of the cases, it could be a vector of numeric elements which represents different kinds of result. Such as classification, object detection, or semantic segmentation.
Let’s take a deeper look on each case:
It is a one-dimensional vector with classes probability. In most cases, it will be float values in a range of 0 till 1. It is not a big amount of data. You can find max value using simple algorithms with maximum is O(n) complexity
It is an array of bounding boxes for objects — 2 coordinates for each corner, class number, and its probability. Even we should do some postprocessing with anchors and nested cycles it is also not so big — 10 numbers for an object.
It can be much more complicated. As a result, we can have a 3-dimensional array which will be sized like input image width, input image height and amount of classes. And each element represents class probability for one pixel of the picture. So, for example, if we want to get segmentation for 12 different classes, our output can look like 512 * 512 * 12 finally more then 3 million elements. And to proceed it correctly we will have to find the biggest class probability for each dot of the image — going through the all elements of the output.
It sounds like expensive work for CPU and waisting of the phone resources. Moreover, from my experience I can say that it could take more time then ML model execution) especially if we are using Channels first format of output layer for the model.
Solution for that is similar to preprocessing — if we need simple operations with a lot of pixels — let delegate it to GPU.
Argmax functions can be easily split for a bench of parallel operations and efficiently executed on GPU
So we should move back to iOS Metal Shaders and Android Render scripts.
At that point, I should say that for mobile phones, with a whole bunch of optimization you can reach incredible results.
It is absolutely possible to run 3 different ML models at the same time on your Mobile GPU with input size 512 to 512, get the output of Object detection, pass it for classification and draw it above the segmentation mask. with the speed as faster as from 10 till 15 frames per second.
With ML technologies and modern hardware, we got AI-powered eyes in our pockets. They can see way better than we can.
It sees the right rout in an unknown city and shows us the direction in our reality)
It reads any text in any language and makes it understandable for us.
It remembers all the lines of our faces and never mixes it up)
And even help you drive your car)
Here I want to finish my series of theoretical (high-level) articles and move to practical low-level coding to make you project work on real devices.