Give yourself a head-start by seeing the big picture.
There are over 10 Computer Vision objectives you can solve with AI. However, in most tutorials only the first 4 are talked about, and the rest are often overlooked. However, without all 10 of them, many emerging technologies such as facial recognition, AI powered security cameras, AI powered medical diagnosis, as well as Tesla’s Full Self Driving feature, wouldn’t be possible today.
In this article, we will start from the most basic types of computer vision and we will see why we need other types to have more real life functionalities step by step. If this is your first encounter with Computer Vision or Artificial Intelligence, do not worry, I will do my best to keep things simple and everthing will come together as you keep reading. Some of the concepts might be alienating at first look, especially at the beginning. That’s why I try to keep my articles narrowed down and as simple as possible, while still capturing the big picture. For starters, let’s first see what we mean by computer vision and why we need it.
Computer Vision, in a nutshell, tries to extract meaningful information from images and videos, using computers, in an automated way. This way, we can leverage cameras and computers to come together and do stuff that would otherwise require a person, or a team to work on manually. Although this may be possible at times, there are many times where having a human presence would be impractical. By leveraging computer vision, we can enable technologies that would otherwise be impossible, such as self driving cars, to make our lives better, safer, happier, while also protecting our privacy.
An image classifier looking at the image above, would probably tell that there is an apple, a cup, a laptop, a chair and maybe a table in this image. It would also give you a confidence score on how sure it is about its predictions. However, its knowledge about the image would stop there. It wouldn’t be able to tell you how many cups there are, how big the apple is, and what the position of the items is.
Image classification is the simplest type of computer vision you can perform. Therefore, if you are just getting started with machine learning, I actually recommend getting started with this one. With image classification, the main objective is to classify the image into one or multiple categories.
There are two main kinds of image classification. The first one is binary classification and the second is multi-class classification. With binary classification, you can check for a single class of object for the given image and get a result based on whether you have that object in your image. For example, you can achieve superhuman performance in detecting skin cancer in humans by training an AI on both images that have skin cancer and images that do not have skin cancer.
If you are interested in learning more about image classification and want to interact with an image classification model yourself, you can actually get a live demo by playing Pacman with your webcam with the link below:
Object Detection is the logical next step in computer vision from image classification. With object detection, you can detect what classes you have in the image, and where you have them in the image as well. The most common approach here is to find that class in the image and localize that object with a bounding box. If you are interested and want to get a practical demo on object detection, you can download the free mobile app for Android or iOS to see a very popular object detection model called YOLOv5 in action. You can also search the app with the name “iDetection”.
With semantic segmentation, you don’t just detect what classes you have in the image as you did with image classification, or you don’t just draw a rough bounding box to say where the object is, but instead, you classify every pixel in the image to determine what objects it contains.
1. Top 5 Open-Source Machine Learning Recommender System Projects With Resources
2. Deep Learning in Self-Driving Cars
3. Generalization Technique for ML models
4. Why You Should Ditch Your In-House Training Data Tools (And Avoid Building Your Own)
Instance Segmentation, in a nutshell, can classify the objects in the image at a pixel level, like the Semantics Segmentation does, but it can also differentiate different instances of that class. Meaning that if you have cars parked next to each other, if you have semantic segmentation, you can tell that there is a big blob of cars, but with instance segmentation, you can tell that there are 5 distinct cars, and this will probably change what you can do with that information.
Panoptic Segmentation, in a nutshell is a combination of Semantic Segmentation and Instance Segmentation. That’s why it is the most powerful one until now. With Panoptic Segmentation, you have pixel level classification capabilities combined with the ability to separate different instances of that class.
If you want to get a more in depth understanding about the distinction between Semantic vs Instance vs Panoptic Segmentation, you can find the following article helpful, where you will discover what makes the main differences between them from the point of view of a self driving car:
Keypoint detection is essentially detecting key points in images to reveal more detail about a class. The two most common keypoint detection areas are body keypoint detection and facial keypoint detection.
Pose estimation, in a nutshell allows you to detect what pose people have in a given image, which usually includes where the head, eyes, nose, arms, shoulders, hands, legs are in an image. This can be done for a single person or multiple people depending on your needs. You can get a demo of it here:
Also you can see an implementation of this with another live demo here:
Similar pose estimation here is a facial landmark detector that can detect features more specifically on your face.
You can also try the live demo with a game:
Person segmentation is the logical next step from Pose Estimation. On top of knowing where the person roughly is, now you have close to pixel level classification on where exactly the person is as well as the pose of that person. You can try it yourself with the demo below:
You can also see an open source project by Facebook AI Research Team called Detectron 2. It can implement everything we have seen until now, including: object detection with bounding boxes, panoptic segmentation, pose estimation and body segmentation simultaneously. Moreover, you can build AI based applications using Detectron 2. You can also see an example of how it all looks together in the example below:
You can also estimate the 3D depth of the objects and the scenes with this neural network. You can check out the Google Colab example of a machine learning model called MiDas: You can run the code in your browser and see the results for yourself with the following Google Colab link.
Image captioning is pretty self descriptive. When you give the neural network an image, it creates a caption for you describing the image. One thing I want you to notice is that, compared to all others until now, this one is not just a computer vision task, but it is also a NLP (Natural Language Processing) task.
3D object reconstruction is about extracting 3D objects from 2D images. Although this can be done in a variety of ways on various objects, it is very much of a developing field. One of the most successful papers on 3D human digitization is called PiFuHD and you can get a demo of it with your own images with this link:
I hope you got some value out of this article. If you have any questions, do not hesitate to leave them as a comment and I will get back to you with an answer as soon as possible.
If you want to have a deeper dive into the 3 types of image segmentation that we have seen in this article, you can check out my article on Semantic Segmentation, Instance Segmentation, and Panoptic Segmentation below: