Hybrid Attention-Based 3D Object Detection with Differential Point Clouds

Based on this hybrid attention mechanism, we propose a 3D object detection method with differential point clouds. Extensive experiments on the KITTI dataset show that our model significantly outperforms state-of-the-art methods.

To make full use of the crucial information, we introduce a new hybrid sampling module (HS) in the sampling layer which integrates various sampling methods, such as D-FPS, F-FPS, and Random Sampling.

In the existing methods, important foreground points are often discarded before the final boundary frame return steps. Therefore, the proposed model pays more attention to the foreground points, aiming to achieve high recognition accuracy through hybrid attention mechanisms. Moreover, we propose a novel HA module that generates pointwise features based on the sampled input point cloud. Finally, the original pointwise features are spliced with enhanced pointwise features to make them more recognizable.

Early works usually convert raw point clouds into regular intermediate representations, including projecting 3D point cloud data from bird’s eye or frontal views into 2D images or dense 3D voxels. However, using voxel conversion to improve efficiency can lead to a lack of critical information, resulting in false and missed detection. PointPillars [ 14 ] encode point clouds with Pillar coding, which achieves extremely fast detection speed. However, it loses many important foreground points simultaneously, making the effect of detail processing not ideal. There have been a lot of missed and false detections in PointPillars. To solve this critical problem, TANet [ 15 ] enhances the local characteristics of the voxel by introducing an attention mechanism. However, due to the information loss during voxel conversion, it is impossible to avoid the occurrence of false and missed detection. In DA-PointRCNN [ 16 ], the density sampling method can pay better attention to where the clouds are sparse and improve missed detection. However, false detection exists due to ignoring the importance of feature information. Therefore, we retain the foreground points with rich details as much as possible from the sampling and remove a large number of background points that do not affect the recognition effect. This makes the entire network architecture more lightweight, and can reduce the occurrence of missed detection. Furthermore, the introduction of the point-wise features of the focus mechanism enhances the prospects of avoiding occurrences of misconduct. In Figure 1 , we show the missed and false detections of PointRCNN in detection due to missing foreground point information.

Currently, existing object detection methods mainly include image-based, point cloud-based, and multi-sensor methods [ 2 ]. In comparing them, image-based methods lack depth and 3D structure information, making it challenging to identify and locate 3D objects accurately in 3D space. Therefore, plans based on image information tend to be less effective than point clouds [ 3 5 ]. GS has proposed fusing point cloud and image data for object detection [ 6 ]. Subsequently, the classic methods MV3D, PC-CNN [ 7 ], AVOD [ 8 ], PointPainting [ 9 ], etc., have been proposed. However, although these fusion methods can integrate the characteristics of point clouds and images to a certain extent for recognition, the vast amount of calculation involved and the complex network has brought considerable challenges to this field. Thus, point cloud-based methods are the main methods for autonomous driving. The method based on the point cloud has developed rapidly in the last few years, and many classic methods have been proposed, including Pointnet [ 10 ], Pointnet++ [ 11 ], VoxelNet [ 12 ], SE-SSD [ 13 ], etc.

3D object detection based on point clouds has many applications in natural scenes, especially in autonomous driving. Point cloud data provide reliable geometric and depth information. However, point clouds are disordered, sparse, and unevenly distributed, increasing the difficulty of object detection [ 1 ].

Point-based detection methods directly process the raw point cloud and effectively utilize the physical information of the point cloud itself. However, the huge amount of data inevitably takes up a lot of time and computing resources. Therefore, improving the efficiency of point-based detection is a bottleneck for this method.

Unlike voxel-based detection methods, point-based methods directly process the disordered and cluttered point cloud. Thus approach obtains features point-by-point in order to predict each point. The point cloud itself contains very rich physical structure information. Therefore, a point-wise processing network was first proposed in the form of PointNet. This network directly takes the original point cloud as input, guaranteeing no loss of physical information from the original point cloud. Subsequently, PointNet++ improved PointNet to improve the detection efficiency of the network and further optimize the network structure. Most of the subsequent point-based methods have used this network and its variants to point cloud for processing. PointRCNN [ 22 ] utilizes PointNet++ to extract features from raw point clouds and a Region Prediction Network (RPN) to generate prediction boxes. 3DSSD [ 23 ] introduces a 3D single-stage detection network which uses Euclidean space to achieve feature sampling for far points. PointGNN [ 24 ] adds a graph neural network to the framework of 3D object detection, effectively improving recognition accuracy. Proposal Contrast [ 25 ] proposed a new unsupervised point cloud pre-training framework to achieve better detection results. Proficient Teachers [ 26 ] introduces a new 3D SSL framework that provides better results and removes the necessity of using confidence-based thresholds to filter pseudo-labels.

In general, detection methods based on voxel detection can achieve better detection effects and higher efficiency to a large extent. However, voxelizing the point cloud inevitably causes information loss. Later research work has made up for the loss and distortion caused by the point cloud data processing stage by continuously introducing complex module designs, which has made up for this defect to a certain extent; however, this has a great impact on detection efficiency. Therefore, using voxelization to process point cloud data has certain limitations.

According to their different detection stages, the existing voxel detectors can be roughly divided into single-stage detectors and two-stage detectors. While these methods are efficient and straightforward, due to the reduction of spatial resolution and insufficient structural information their detection performance is significantly affected when the point cloud is relatively sparse. Thus, SA-SSD [ 18 ] supplements the utilization of structural information by adding auxiliary networks. HVNet [ 19 ] offers a hybrid voxel network that refines the projected and aggregated feature maps from multiple scales to improve detection performance. CIA-SSD [ 20 ] introduces a network incorporating IOU-aware confidence correction to extract spatially informative features of detected objects. In comparison, two-stage detectors can achieve higher performance at the cost of higher computation and storage. Part-A 21 ] proposes a two-stage detector consisting of part perception and aggregation modules, which is better able to utilize the location information of detected objects.

In point cloud-based methods, converting the raw point cloud into a regular voxel grid and extracting local features for object detection has attracted much attention. the The voxel concept was first proposed with VoxelNet, in which the point cloud is divided into voxels by block and detected by extracting local features from each voxel. However, even this requires considerable computation. SECOND [ 17 ] adds a sparse convolution operation based on VoxelNet to speed up calculation. PointPillars directly converts point clouds into fake images, avoiding the time-consuming convolution calculation.

Because the sample classification data are quite different when the foreground points are segmented, we choose focal loss as the classification loss function, as shown in Formula (4). Finally, the total loss function iswhereis the number of positive probability anchors,, and

The center error loss function of the bounding box is defined aswhereis the smoothing function, and the specific calculation method is

The loss function of hybrid sampling is designed with reference to [ 29 ], and the specific formula is shown in Formula (1). The overall loss function in this work is designed based on [ 15 22 ]. According to the 3D box we designed, the linear regression between the true value of the detected object and the predicted anchor point is defined aswhere the subscriptrepresents the ground truth of the object, while the subscript a represents the predicted value (Anchor Box):

Considering that the object detected in this work is a three-dimensional object, the traditional 2D detection frame is no longer applicable thus, we use an upgraded 3D detection frame here. Therefore, in designing the loss function in this work we use the Intersection over Union (IoU), which is an upgraded three-dimensional space-based Intersection over Union model, which we call 3D-IoU. The specific form is shown in Figure 5

In most scenes, the number of foreground points tends to be much less than that of background points. Therefore, we use the focal loss function [ 28 ] to address the classification imbalance issue:whereis the probability of the foreground point. In the process of point cloud segmentation training, we keep the settingsas the original values by default.

After the previous HA module processing, the full-point feature map F f u l l ∈ R ( N × ( d + c ) ) is generated. On this basis, the foreground segmentation branch composed of two convolution layers is attached to it and the confidence S f o r e ∈ R N of each point in the input point set P is further estimated. The Sigmoid function is used to normalize the S f o r e to generate the foreground mask S f o r e n o r m ∈ R N , which is used as an important basis for subsequent segmentation.

Previous modules reserve foreground points with rich information for this task. Segmenting the foreground points enables the point cloud network to capture contextual information. This can improve the accuracy of pointwise prediction and benefit the generation of 3D prediction boxes. For this, we use a bottom-up 3D prediction box generation method. Foreground point segmentation and prediction box generation are performed simultaneously, generating prediction boxes directly from the reserved foreground points.

Finally, the original feature F 1 obtained by the sampling process is combined for splicing; the final feature F 2 is the output. Through the above operations the pointwise features are enhanced, which significantly contributes to the final task of foreground point segmentation while suppressing irrelevant and noisy features. We call this attention mechanism module the HA module.

In the HA module, we adopt two kinds of attention mixed in parallel; the specific architecture is shown in Figure 4 . First, the point-wise features obtained in the sampling process are used as pointwise and channelwise inputs. To increase the spatial receptive field of each channel feature, two pooling operations (average pooling and maximum pooling) are used independently, and are respectively denoted byand. Then, the sigmoid function is used to perform the final nonlinear activation operation, thereby generating all the network parameters. The required channel attention weight matrix. The specific formula is expressed as follows:whereis the sigmoid function andandare the channel attention weights learned through MLP.

As shown in Figure 2 , for each input, the coordinate information of all points inside forms an input vector. We extract the point-wise features of each input by learning the mapping; here, we set a three-layer MLP, and the sizes are all (64, 128, 128):allowing us to obtain the point-wise feature representation, which is transformed by the subsequent layer for deeper feature learning.

Consider the original point clouds, represented as G = V , D , where V = p 1 , p 2 , … , p n indicates n points in a D dimensional metric space. In our approach, D is set to 4; thus, each point in 3D space is defined as v i = x i , y i , z i , where x i , y i , z i denote the coordinate values of each point along the axes X, Y, Z, while the fourth dimension is the laser reflection intensity, denoted as s i and each input P contains N points, P = p i = v i , s i T ∈ R 4 ( i = 1 , 2 … N ) .

Furthermore, we additionally introduce a branch to exploit the underlying feature semantics. In particular, two MLP layers are attached to the encoding layer in order to further estimate each point’s semantic category. Here, we use the vanilla cross-entropy loss functionwhererepresents the number of categories,is a one-hot label, andrepresents the predicted logit. By inference, the top k foreground points are kept, regarded as feedback, and sent to the following encoding layer as representative points.

Most of the sampling methods used in the current model are D-FPS (Distance-Farthest Point Sampling) and F-FPS (Feature-Farthest Point Sampling). Because F-FPS can retain many foreground points through the SA layer while the total number of representative points is limited, many background points are discarded in the process of downsampling. While this makes regression easier, it is not conducive to classification. The SA layer gathers features from adjacent points, and the background points usually cannot find enough surrounding points. These issues make it challenging to distinguish foreground points from background points, resulting in poor classification performance. To better preserve the foreground points without affecting the later regression, we propose a new hybrid sampling (HS) method. Multiple sampling methods, such as D-FPS, F-FPS, and random sampling, are mixed in parallel to preserve more foreground points for localization along with enough background points for classification.

To improve the detection efficiency of 3D objects, especially in the face of point cloud data with a huge amount of data, progressive downsampling must be used to improve the calculation speed and reduce costs. However, active downsampling may result in the loss of foreground points, leaving the valuable information of the detection object missing, making it easy for this approach to cause missed or false detection. Therefore, we propose a hybrid sampling strategy as our sampling method; the specific sampling rules are shown in Figure 3

As shown in Figure 2 a, the proposed model framework mainly consists of three parts: Hybrid Sampling (HS), a Hybrid Attention Mechanism (HA), and Foreground Point Segmentation. First, the input original point cloud is processed through hybrid sampling, with as many foreground points retained as possible. Then, the point-wise features are generated by the HA module and focused. Subsequently, the foreground segmentation network is used to segment the foreground points and generate prediction boxes. Finally, 3DNMS is used to filter the prediction box and the refinement module retains the final boxes. In Figure 2 b, each sampled point cloud input is extracted pointwise and then focused in the attention layer. Finally, the generated original pointwise features and the pointwise features developed by the attention layer are spliced together.

Unlike voxel-based methods, point-based methods need to perform point-wise detection, and as such need to pay more attention to foreground points (i.e., cars, pedestrians, etc.). However, most current point-based object detection frameworks usually adopt downsampling methods, such as random sampling [ 27 ] or farthest point sampling. Although these sampling methods can improve computational efficiency, the essential foreground points are ignored. Therefore, in this work we aim to train a point-based model to better retain the information of foreground points and efficiently detect multiple types of objects at one time. Based on this, we propose an efficient point cloud-based object detection algorithm.

In addition, this work uses Precision and Recall as a measure of the results. To this end, we show the multi-view P-R change curve of HA-RCNN for vehicle detection in the figure below. Each view contains three levels of P-R change curves: Easy, Moderate, and Hard. The specific situation is shown in Figure 7

To analyze the effects of the different components of HA-RCNN, we conducted extensive ablation experiments on the car. We used the initial structure as the baseline of the experiment and only cut off the connection between the HS module and the HA module. In eaching the results shown in Table 4 , extensive ablation experiments were performed on the proposed HS and HA modules. When there was only a baseline, the mAP detected by the model reached 86.43%, 77.39%, and 75.87%, respectively. When only the HS module was added, the mAP increased to 87.26%, 78.55%, and 76.91%. With only the HA module, mAP can reach 88.49%, 79.15%, and 77.44%. Finally, when the two modules were added together, mAP reached 89.23%, 79.88%, and 77.92%. It can be clearly observed that the HS module and the HA module are of great significance, especially when the two modules are integrated into the model, and the improvement in target detection accuracy is excellent.

To further validate the superior performance of our model, we evaluated the performance of HA-RCNN on the Waymo dataset. This dataset consists of nearly 160k 360-degree lidar samples in the training set and 40k samples in the validation set with panoramic annotation objects. To make a fair comparison, we adjusted our framework during the evaluation, changing the number of input points from 16,384 to 65,536, and increased the sampling ratio of each sampling layer to four times. The results after comparison are shown in Table 3 . Compared with other methods, our HA-RCNN has obvious vehicle and cyclist detection advantages. In pedestrian detection, HA-RCNN has only a slight advantage. We speculate that the pedestrian point cloud is relatively sparse, which is not conducive to detection. Therefore, we try to solve this problem in future work.

We evaluated our model on the 3D detection benchmark on the KITTI test server; the results are shown in Table 2 . The HA-RCNN used in this work was trained for 200 epochs, with the trained loss curve shown in Figure 6 . As the number of epochs increases, the value of the loss function of the HA-RCNN decreases rapidly. The decreasing trend of the loss value slows down after 25 epochs of training and flattens after 50 epochs. The final loss curve plateaus after 150 epochs. As can be seen from the figure, our loss function converges faster, the loss value is low, and the final loss value is maintained at around 0.5.

In the stage of foreground point segmentation, for better robust segmentation we ignore the background points near the object boundary during the training process by enlarging the 3D ground truth box on each side of the object by 0.2 m. The prediction frame is generated according to the pointwise vector. For the prediction frame classification training, if the maximum 3D IoU and the actual frame are above 0.6, the prediction frame is considered to be a positive value, while if the maximum 3D IoU is below 0.45, the prediction is considered to be a negative value. In the experiments, we use the 3D IoU value of 0.55 as the minimum criterion for prediction box regression training.

During training, for each scene we sampled 16384 points as input. For scenes with less than 16384 points, we randomly repeated any point in the scene to reach 16,384 points. In the sampling stage, we followed the network structure of PointNet++ in the SA layer and used four ensemble abstraction layers with multi-scale heel groups to divide points into groups of 4096, 1024, 256, and 64. The detailed parameter settings are shown in Table 1

Because the training and testing descibed in this article are based on the KITTI dataset, the metrics officially provided by KITTI are directly used as the metrics in this work. Therefore, we use the Average Precision (AP) here as our evaluation index. The specific calculation formula ofis as follows:whereis the precision,is the recall,is the average precision,is the number of detection frames with IoU > 0.5,is the number of detection frames with IoU ≤ 0.5 or the number of redundant detection frames with the same actual value, andis number of valid values not detected.

In this work, Velodyne data, Image data, Calib data, and Label data in KITTI are selected for related experiments, and the object detection effect is displayed on the visual view of the point cloud space [ 30 ]. The KITTI dataset contains three difficulty detection targets, namely, Easy, Moderate, and Hard. To analyze the results of the experiments, we employ KITTI’s official evaluation method in this work, in which the IoU (Intersection over Union) is used as a measure of the positioning accuracy of the object detection frame. Supposing that the overlap between the object detection frame and the label frame (IoU) reaches more than 50%. In such a case it is considered that the object has been correctly detected; the calculation formula of the intersection ratio is

We used HA-RCNN to conduct systematic experiments on the KITTI dataset and compare it with state-of-the-art models. Finally, HA-RCNN was analyzed through various expressions such as the loss function, P-R curve, and visual effect map.

5. Discussion

After extensive systematic experiments to verify the effectiveness of our proposed mixed sampling approach, we report the actual recall (i.e., the ratio of instances still retained after sampling) for each layer in Table 5 . At the same time, we report the recall of the random sampling method, the Euclidean distance-based sampling method (D-FPS), and the feature distance-based sampling method (F-FPS) for comparison.

From the analysis in Table 5 , we can reach the following conclusions:

(1) After multiple random sampling operations the recall rate drops significantly, which means that a large number of foreground points are discarded.

(2) D-FPS and F-FPS have a reasonable recall rate in the early stage and the later effect is slightly worse, which causes loss of information on the foreground points. Therefore, it is challenging to accurately detect objects of interest, especially after multiple samplings with a limited number of preserved foreground points.

(3) Our mixed sampling approach has a significant performance advantage over most current methods, achieving higher recall and retaining more foreground points.

From the analysis in Table 2 , it can be concluded that our model has apparent advantages for the detection of cars and bicycles in simple and medium detection compared with several previous classic methods, while the advantage is relatively small in detecting complex samples. Currently, many models take both RGB images and point cloud data as input to improve detection accuracy. However, our model only takes point cloud data as the input for object detection, and nonetheless achieves better performance.

For pedestrian detection, our method is not much different from previous ones that only use lidar data. However, our detection performance is slightly inferior compared to the multi-sensor approach. We believe that this is because the size of pedestrians is smaller, making their associated point clouds much smaller than those of cars and bicycles. Although our approach reserves as many foreground points as possible, the effect is insufficient because the object is too small. Ideally, using images can capture more detailed objects, meaning that multi-sensor detection methods have more advantages than our approach with respect to pedestrian detection.

In order to more intuitively observe the detection effect of our HA-RCNN model, we visualize the detection results in Figure 8 . It can be seen from the visualization that our proposed HA-RANN model has good detection performance for cars and cyclists in various environments.