Nowadays Computer Vision is a quite popular area of modern Machine Learning and Artificial Intelligence systems. We could find plenty of articles about how to create and train Object recognition or Semantic Segmentation model using prepared model architecture and high-level frameworks like Tensorflow or PyTorch. Most of them referred to Python as the main language and Cloud Solutions as the main execution environment.
This solution looks good for photo editing when the user can wait for server response with ML model inference data but if we are talking about real-time video processing (mobile camera video stream for example) it can not be a reliable solution. Network latency will not allow us to make smooth video effects. It will require at least 10+ frame per second speed. So this is the point where “On-device ML model execution” comes in.
In the next several articles, I will describe the main steps and obstacles on the way to create “real mobile-CV application”.
We can divide the whole process into the next few steps:
- Preprocessing. Image to Matrix.
- Select the right mobile ML tool. ML inference.
- Post Processing. Vector ML model output to the application data
Inbound of this Article I am going to talking about the first step.
So, we are close to starting the implementation of our mobile CV-application. We have prepared (trained)ML model from our Data Sciences/ML engineers. Even before we try to run it for the first time, we have to prepare a pre-processing system which would take our raw video frames and efficiently convert it into a necessary format for the ML model.
Which format is using ml model?
For sure it depends on the training process and could be represented as a several dimensional array of the image color data with some mean value. The mean value helps to keep this data as float point value inbound from 0 to 1 for efficient ML model training process. This information should be provided as the model technical specification.
For example, many models are using RGB format with mean value divider 127. Assume that our model is using 300x300x3 input size. It means that before you pass a video frame for ML model inference you should convert it into RGB byte array with values from 0 till 255, next resize it to 300×300 frame,
after that, convert to float point array with a value from 0 till 1.
What the efficient way of image preprocessing?
As you can see the preprocessing process can be resource-intensive operation. Especially when we are talking about Full-HD (1920×1080) video stream and high resolution (for instance 512×512) ML model input.
From the first glance, it looks like we should copy our Frame from GPU to CPU memory and go through its data several time to convert and normalize it.
And that would be a terrible decision for the next reasons:
- we will be fined for new memory allocation and copying data from one memory slot to another
- our CPU has much less computational headers than GPU and consume more energy
- we will dramatically increase memory usage and amount of allocations for temp objects. It will lead to overkill usage of garbage collector for JVM languages
So good idea is to apply GPU acceleration for Frame data processing:
- Using Metal shaders for iOS.
- Using Render Scripts for Android.
Using Camera 2 API in combination with Render Script surface could bring significant benefits for frame preprocessing. With good parallelization between GPU kernels, it performs much faster converting from raw YUV format to RGB array or even Float point array with normalized value.
You can find an example in my repo where I am using this system for Video Stream processing from Camera as well as from Media Executor.
It can help to make all necessary transformations at the moment of getting video frame from the mobile device camera (or any other video source) without copying it to CPU memory, using much less memory and battery.
What about embedded devices?
If we are talking about non-mobile OS embedded devices we have a few options too but they are quite limited. Mostly their performance and optimization depend on hardware vendor:
- NVidia GPU based devices.
If the device has more or less new NVidia GPU you are on the safe side) CUDA will help you implement all necessary preprocessing most efficiently (also it will help you with ML inference but we will discuss it in the next article) - ARM Mali GPU or Qualcomm Adreno based devices.
In most cases, we can use Open CL for Mali as well as for Adreno. With that Qualcomm provides a wide variety of developers tools to use extra resources such as DSP. - Rest of devices without GPU or with low-performed GPU.
To tell the truth it is a nontrivial task to try to do video stream processing on the device which is not produced for intensive GPU work but even there we have a couple of options. Of course, I am talking about classical computer vision algorithms and libraries like Opencv (by the way latest Opencv libs already have several ML models on the board) With that most of embedded devices based on ARM CPU arch so we can use some Neon optimizations to speed up image processing
From all the above, you can see that the preprocessing task can be a vital thing for your application and significant influence on performance and power-consuming of the solution. I hope my advice helps you find the right way to implement your ML mobile application.
In the next article, I want to talk about the main factors that can affect the choice of the right tool for on-device ML inference.
Credit: BecomingHuman By: Boris Denisenko