Welcome to the final part of this multiple part series where we discuss five pioneering research papers for you to get started with 3D object detection. In this article, we will discuss Complex YOLO by Martin Simon et. al. This paper aims to extend the very famous YOLO networks for bounding box detections in images to 3D point clouds. If you have been following this series, we have gone through four different, very diverse algorithms that aim at solving the problem of bounding box detections in 3D point clouds. Complex YOLO is part of the Frustum based methods available for this task. For an introduction to different types of techniques that have been applied in this domain, please check out Part 1 of this series.
Applications of 3D bounding box detection
Lidar based 3D object detection is inevitable for autonomous driving because it directly links to environmental understanding and therefore builds the base for prediction and motion planning. The capacity of inferencing highly sparse 3D data in real-time is an ill-posed problem for lots of other application areas besides automated vehicles, e.g. augmented reality, personal robotics, or industrial automation.
1. Natural Language Generation:
The Commercial State of the Art in 2020
2. This Entire Article Was Written by Open AI’s GPT2
3. Learning To Classify Images Without Labels
4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
Bird’s Eye View
As we can see in the pipeline, the first step (preprocessing) is to convert the 3D point cloud into a BEV (Bird’s Eye View). Just like images, the three channels used in a BEV, the RGB map of a point cloud is composed of height, intensity, and density. The process of generating a BEV from a point cloud is as follows:
- Decide that area we are trying to encode. Since a LiDAR point cloud can cover a very large area, we need to confine our calculations on a smaller area based on the application. For the application of self-driving cars, this area is 80m X 40m.
- Now, this area is divided into a grid of some resolution, in this case, it’s 8cm. Since the range of the y-axis was double than that of the x-axis, the size of the final grid will be 1024 X 512.
- Note that we are yet to encode our data points, so after separating the search space into grids, we can encode each grid cell for its height, intensity, and density based on the number of points contained in each grid cell. To encode height, the maximum height is taken for points inside that grid cell. Similarly, the intensity is encoded by taking the maximum intensity. Finally, the density of the points is calculated for each grid cell. Snippet from the paper for calculations is shown below.
- This finally results in an image that encodes the point cloud information.
The Complex-YOLO network takes a birds-eye-view RGB-map as input. It uses a YOLO CNN architecture to detect the 3D objects in real-time. The translation from 2D to 3D is done by a predefined height based on each object class.
The YOLO Network divides the image into a grid (16 X 32) in this case and then, for each grid cell, predicts 75 features. Let’s see how we get these 75 features.
- 5 boxes per grid cell. YOLO predicts a fixed set of boxes, in this case, 5 per grid cell. Note that the grid size is (16 X 32) in this case as described above.
- For each box, the box dimensions, and angles (real and imaginary part, explained later). [Tx, Ty, Tw, Tl, Tim, Tre] where Tx, Ty, Tw, Tl are the x, y, width, and length of the bounding box. Tim, Tre is the real and imaginary parts of the angle of bounding box orientation. Hence, 6 parameters per bounding box.
- An objectness probability, i.e. probability of the predicted bounding box containing an object and how accurate is the bounding box. 1 parameter.
- 3 parameters of probabilities of a bounding box belonging to each class (car, pedestrian, cycle).
- 5 additional parameter alpha, Cx, Cy, Pw, Pl used in the calculations shown in the E-RPN section below.
- So based on our calculations, for each grid cell, there are 5 bounding boxes and each bounding box has (6 + 1 + 3 + 5) parameters. That makes it 75 parameters per grid cell.
Euler Region Proposal Network (E-RPN)
The purpose of E-RPN is to parse the object dimension parameters from the incoming feature map and estimate accurate object orientations and boundaries of the bounding boxes.
The performance evaluation of Complex Yolo is done with comparatively older networks in the paper. Still, compared to the latest available networks for bounding box detection on 3D point clouds, Complex YOLO provides a good trade-off between accuracy and inference speed. Also, it is important to note that Complex Yolo has a surprisingly high precision for the Cyclist class in the Kitti dataset, which has a lesser number of examples in the training set as opposed to other classes.
That’s it for this post. You can take a look at Part 5 of this series here.
Martin Simon et al, “Complex-YOLO: An Euler-Region-Proposal for Real-time 3D Object Detection on Point Clouds”, 2019