Now that we have covered the fundamentals we can take a look at the techniques in more detail.
In this section, we will introduce Image Classification, which is the task of assigning one label from a fixed set of categories to an image. This is one of the core problems in computer vision that, despite its simplicity, has a large variety of practical applications. Many other seemingly distinct computer vision tasks (such as image captioning, object detection, keypoint detection, and segmentation) can be reduced to image classification whilst others leverage entirely new neural network architectures. The following video clip illustrates a very simple classification example.
Image keywording and captioning
These techniques are at the intersection of the two most interesting fields of AI, computer vision and Natural Language Processing (NLP). Keywords are words that are used to describe the elements of your photograph or image. Image captioning refers to the process of generating textual description from an image or video, based on the objects and actions in the image. An example of this can be seen in the following image.
Object detection is a computer vision technique that identifies and locates objects in images or videos. This is typically done by encompassing the objects with bounding labeled boxes. Object detection is a key technology behind self-driving cars, enabling them to recognize other cars or distinguish a pedestrian from a lamppost. It is also useful in a variety of applications such as industrial inspection, and robotic vision. Due to the ImageNet competition, there has been a 1.7× reduction in localization error (from 42.5% to 25.3%) between 2010 and 2014 alone. The video clip below shows the results of a real-time implementation of this technique for the detection of cars, people and other common objects found in a city that are relevant to the vision system of a self-driving car. Further applications of this technique are object character recognition tasks (OCR), that aim to extract printed or hand written text found in a image or video.
Keypoint detection and pose estimation
A key point is a feature that is considered as an interesting or important part of an image. They are spatial locations or points in the image that define what is interesting or what stands out in the image. The reason why key points are special is that it is possible to track the same key points in a modified image where the image, or the objects in the image, are subject to rotation, contraction/expansion or deformation.
Pose Estimation is a general problem in computer vision where the aim is to detect the position and orientation of an object. This usually means detecting keypoint locations of the object. This technique can be used to create a very accurate 2D/3D model that describes the position of the key points of the object, which can then be used to create a digital twin that can be updated in real time.
For example, in the problem of pose estimation common boxy household objects the corners can be detected to gain insight into the 3D position of the objects in the environment.
The same can be done for detecting human poses, where the key points on a human body such as shoulders, elbows, hands, knees, and feet are detected.
Semantic segmentation (a.k.a. masking)
The next technique is known as semantic segmentation and it addresses one of the key problems in the field of computer vision as it intuitively separates objects in an image by clearly defining their boundary. Looking at the big picture, semantic segmentation paves the way towards complete scene understanding. This is incredibly useful because it gives a computer the ability to precisely identify the boundaries of different objects. The importance of scene understanding as a core computer vision problem is highlighted by the fact that an increasing number of applications nourish from the knowledge gained with semantic segmentation. In the self-driving car example shown below, it helps the car identify the exact position of the road and other objects.