PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection | SpringerLink

In this section, we first introduce our experimental setup and implementation details in Sec. 5.1. Then we present the main results of our PV-RCNN/PV-RCNN++ frameworks and compare with state-of-the-art methods in Sec. 5.2. Finally, we conduct extensive ablation experiments and analysis to investigate the individual components of our proposed frameworks in Sec. 5.3.

5.1

Experimental Setup

Datasets and Evaluation MetricsFootnote 1. We evaluate our methods on the Waymo Open Dataset (Sun et al., 2020), which is currently the largest dataset with LiDAR point clouds for 3D object detection of autonomous driving scenarios. There are totally 798 training sequences with around 160k LiDAR samples, 202 validation sequences with 40k LiDAR samples and 150 testing sequences with 30k LiDAR samples.

The evaluation metrics are calculated by the official evaluation tools, where the mean average precision (mAP) and the mAP weighted by heading (mAPH) are used for evaluation. The 3D IoU threshold is set as 0.7 for vehicle detection and 0.5 for pedestrian/cyclist detection. The comparison is conducted in two difficulty levels, where the LEVEL 1 denotes the ground-truth objects with at least 5 inside points while the LEVEL 2 denotes the ground-truth objects with at least 1 inside points. As utilized by the official Waymo evaluation server, the mAPH of LEVEL 2 difficulty is the most important evaluate metric for all experiments.

Network Architecture For the PV-RCNN framework, the 3D voxel CNN has four levels (see Fig. 1) with feature dimensions 16, 32, 64, 64, respectively. Their two neighboring radii $r_k$ of each level in the voxel set abstraction module are set as $(0.4\text {m}, 0.8\text {m})$, $(0.8\text {m},1.2\text {m})$, $(1.2\text {m}, 2.4\text {m})$, $(2.4\text {m}, 4.8\text {m})$, and the neighborhood radii of set abstraction for raw points are $(0.4\text {m}, 0.8\text {m})$. For the proposed RoI-grid pooling operation, we uniformly sample $6\times 6\times 6$ grid points in each 3D proposal and the two neighboring radii ${\tilde{r}}$ of each grid point are $(0.8\text {m}, 1.6\text {m})$.

Table 2 Performance comparison on the test set of Waymo Open Dataset by submitting to the official test evaluation server.

$*$

: re-implemented by (Zhou et al., 2020).

$\ddagger $

: performance reported in the official open-source codebase of (Yin et al., 2021). “2f”, “3f”: the performance is achieved by using multiple point cloud frames

Full size table

For the PV-RCNN++ framework, we set the maximum extended radius $r^{(s)}=1.6m$ for proposal-centric filtering, and each scene is split into 6 sectors for parallel keypoint sampling. Two VectorPool aggregation operations are adopted to the $4\times $ and $8\times $ feature volumes of voxel set abstraction module with the half length $\delta =(1.2m, 2.4m)$ and $\delta =(2.4m, 4.8m)$ respectively, and both of them have local voxels $n_x=n_y=n_z=3$. The VectorPool aggregation on raw points is set with $n_x=n_y=n_z=2$. For RoI-grid pooling, we adopt the same number of RoI-grid points ($6\times 6\times 6$) as PV-RCNN, and the utilized VectorPool aggregation has local voxels $n_x=n_y=n_z=3$, and half length $\delta =(0.8m, 1.6m)$.

Training and Inference Details Both two frameworks are trained from scratch in an end-to-end manner with ADAM optimizer, learning rate 0.01 and cosine annealing learning rate decay. To train the proposal refinement stage, we randomly sample 128 proposals with 1:1 ratio for positive and negative proposals, where a proposal is considered as a positive sample if it has at least 0.55 3D IoU with the ground-truth boxes, otherwise it is treated as a negative sample. Both two frameworks are trained with three losses with equal loss weights (i.e., region proposal loss, keypoint segmentation loss and proposal refinement loss), where the region proposal loss is same as (Yin et al., 2021) and the proposal refinement loss is same as (Shi et al., 2020b).

During training, we adopt the widely used data augmentation strategies for 3D detection, including random scene flipping, global scaling with a scaling factor sampled from [0.95, 1.05], global rotation around Z axis with an angle sampled from $[-\frac{\pi }{4}, \frac{\pi }{4}]$, and the ground-truth sampling augmentation (Yan et al., 2018) to randomly ”paste” some new objects from other scenes to current training scene for simulating objects in various environments. The detection range is set as $[-75.2, 75.2]m$ for X and Y axes, and $[-2, 4]m$ for the Z axis, while the voxel size is set as (0.1m, 0.1m, 0.15m). More training details can be found in our open source codebase https://github.com/open-mmlab/OpenPCDet.

For the inference speed, our PV-RCNN++ framework can achieve state-of-the-art performance with 10 FPS for $150m \times 150m$ detection range on Waymo Open Dataset (three times faster than PV-RCNN), where a single TITAN RTX GPU card is utilized for profiling.

5.2

Main Results

In this section, we demonstrate the main results of our proposed PV-RCNN/PV-RCNN++ frameworks, and make the comparison with state-of-the-art methods on the large-scale Waymo Open Dataset (Sun et al., 2020). By default, we adopt the center-based RPN head as in (Yin et al., 2021) to generate 3D proposals in the first stage, and we train a single model in each setting for detecting the objects of all three categories.

Comparison with State-of-the-Art Methods As shown in Table 1, for the 3D object detection setting of taking a single frame point cloud as input, our PV-RCNN++ (i.e., “PV-RCNN++”) outperforms previous state-of-the-art works (Yin et al., 2021; Shi et al., 2020b) on all three categories with remarkable performance gains (+1.88% for vehicle, +2.40% for pedestrian and +1.59% for cyclist in terms of mAPH of LEVEL 2 difficulty). Moreover, following (Sun et al., 2021), by simply concatenating an extra neighboring past frame as input, our PV-RCNN framework can also be evaluated on the multi-frame setting. Table 1 (i.e., “PV-RCNN++2f”) demonstrates that the performance of our PV-RCNN++ framework can be further boosted by using 2 frames, which outperforms previous multi-frame method (Sun et al., 2021) with remarkable margins (+2.60% for vehicle, +5.61% for pedestrian in terms of mAPH of LEVEL 2).

Meanwhile, we also evaluate our frameworks on the test set by submitting to the official test server of Waymo Open Dataset (Sun et al., 2020). As shown in Table 2, without bells and whistles, in both single-frame and multi-frame settings, our PV-RCNN++ framework consistently outperforms previous state-of-the-art (Yin et al., 2021) significantly in both vehicle and pedestrian categories, where for single-frame setting we achieve a performance gain of +1.57% for vehicle and +2.00% for pedestrian in terms of mAPH of LEVEL 2 difficulty, and for multi-frame setting we achieve a performance gain of +2.93% for vehicle detection and +2.03% for pedestrian detection. We also achieve comparable performance for the cyclist category on both the single-frame and multi-frame settings. Note that we do not use any test-time augmentation or model ensemble tricks in the evaluation process. The significant improvements on the large-scale Waymo Open dataset manifest the effectiveness of our proposed framework.

Table 3 Performance comparison of PV-RCNN and PV-RCNN++ on the validation set of Waymo Open Dataset. We adopt two settings for both two frameworks by equipping with different RPN heads for proposal generation, which are the anchor-based RPN head as in (Shi et al., 2020b) and the center-based RPN head as in (Yin et al., 2021). Note that PV-RCNN++ adopts the same backbone network (without residual connection) with PV-RCNN for fair comparison

Full size table

Table 4 Performance comparison of PV-RCNN and PV-RCNN++ on the test set of KITTI dataset. The results are evaluated by the most important moderate difficulty level of KITTI evaluation metric by submitting to the official KITTI evaluation server

Full size table

Comparison of PV-RCNN and PV-RCNN++ Table 3 demonstrates that no matter which type of RPN head is adopted, our PV-RCNN++ framework consistently outperforms previous PV-RCNN framework on all three categories of all difficulty levels. Specifically, for the anchor-based setting, PV-RCNN++ surpasses PV-RCNN with a performance gain of +1.54% for vehicle, +3.33% for pedestrian and 4.24% for cyclist in terms of mAPH of LEVEL 2 difficulty. By taking the center-based head, PV-RCNN++ also outperforms PV-RCNN with a +0.93% mAPH gain for vehicle, a +1.58% mAPH gain for pedestrian and a +1.83% mAPH gain for cyclist in terms of LEVEL 2 difficulty.

The stable and consistent improvements prove the effectiveness of our proposed sectorized proposal-centric sampling algorithm and VectorPool aggregation module. More importantly, our PV-RCNN++ consumes much less calculations and GPU memory than PV-RCNN framework, while also increasing the processing speed from 3.3 FPS to 10 FPS for the 3D detection of $150m \times 150m$ such a large area, which further validates the efficiency and the effectiveness of our PV-RCNN++.

As shown in Table 4, we also provide the performance comparison of PV-RCNN and PV-RCNN++ on the KITTI dataset (Geiger et al., 2012). Compared with Waymo Open Dataset, KITTI dataset adopts different kinds of LiDAR sensor and the scene in KITTI dataset is about four times smaller than the scene in the Waymo Open Dataset. Table 3 shows that PV-RCNN++ outperforms previous PV-RCNN on all three categories of KITTI dataset with remarkable average performance margin, demonstrating its effectiveness on handling different kinds of scenes and different LiDAR sensors.

5.3

Ablation Study

In this section, we investigate the individual components of our PV-RCNN++ framework with extensive ablation experiments. We conduct all experiments on the large-scale Waymo Open Dataset (Sun et al., 2020). For efficiently conducting the ablation experiments, we generate a small representative training set by uniformly sampling $20\%$ frames (about 32k frames) from the training setFootnote 2, and all results are evaluated on the full validation set (about 40k frames) with the official evaluation tool. All models are trained with 30 epochs and batch size 16 on 8 GPUs.

We conduct all ablation experiments with the center-based RPN head (Yin et al., 2021) on three categories (vehicle, pedestrian and cyclist) of Waymo Open Dataset (Sun et al., 2020), and the mAPH of LEVEL 2 difficulty is adopted as the evaluation metric.

Table 5 Effects of voxel set abstraction (VSA) and RoI-grid pooling modules, where the UNet-decoder and RoI-aware pooling are the same with (Shi et al., 2020b). All experiments are based on PV-RCNN++ framework with a center-based RPN head

Full size table

Table 6 Effects of different feature components for voxel set abstraction. “Frame Rate” indicates frames per seconds in terms of testing speed. All experiments are conducted on PV-RCNN++ framework with a center-based RPN head. Note that the default setting of PV-RCNN++ does not use the voxel features

$f_i^{(\text {pv}_{1, 2})}$

by considering its negligible gain and higher latency

Full size table

Effects of Voxel-to-Keypoint Scene Encoding In Sec. 3.2, we propose the voxel-to-keypoint scene encoding strategy to encode the global scene features to a small set of keypoints, which serves as a bridge between the backbone network and the proposal refinement network. As shown in the $2^{nd}$ and $4^{th}$ rows of Table 5, our proposed voxel-to-keypoint scene encoding strategy achieves better performance than the UNet-based decoder while summarizing the scene features to much less point-wise features than the UNet-based decoder. For instance, our voxel set abstraction module encodes the whole scene to around 4k keypoints for feeding into the RoI-grid pooling module, while the UNet-based decoder network needs to summarize the scene features to around 80k point-wise features in most cases, which validates the effectiveness of our proposed voxel-to-keypoint scene encoding strategy. We consider that it might benefit from the fact that the keypoint features are aggregated from multi-scale feature volumes and raw point clouds with large receptive fields, while also keeping the accurate point locations. Besides that, we should also note that the feature dimension of UNet-based decoder is generally smaller than the feature dimensions of our keypoints since the UNet-based decoder is limited to its large memory consumption on large-scale point clouds, which may degrade its performance.

We also notice that our voxel set abstraction module achieves worse performance (the $1^{st}$ and $3^{rd}$ rows of Table 5) than the UNet-decoder when it is combined with RoI-aware pooling (Shi et al., 2020b). This is to be expected since RoI-aware pooling module will generate lots of empty voxels in each proposal by taking only 4k keypoints, which may degrade the performance. In contrast, our voxel set abstraction module can be ideally combined with our RoI-grid pooling module and they can benefit each other by taking a small number of keypoints as the intermediate connection.

Effects of Different Features for Voxel Set Abstraction The voxel set abstraction module incorporates multiple feature components (see Sec. 3.2 ), and their effects are explored in Table 6. We can summarize the observations as follows: (i) The performance drops a lot if we only aggregate features from high level bird-view semantic features ($f_i^{(\text {bev})}$) or accurate point locations ($f_i^{(\text {raw})}$), since neither 2D-semantic-only nor point-only are enough for the proposal refinement. (ii) As shown in 6$^{th}$ row of Table 6, $f_i^{(\text {pv}_3)}$ and $f_i^{(\text {pv}_4)}$ contain both 3D structure information and high level semantic features, which can improve the performance a lot by combining with the bird-view semantic features $f_i^{(\text {bev})}$ and the raw point locations $f_i^{(\text {raw})}$. (iii) The shallow semantic features $f_i^{(\text {pv}_1)}$ and $f_i^{(\text {pv}_2)}$ can slightly improve the performance but also greatly increase the training cost. Hence, the proposed PV-RCNN++ framework does not use such shallow semantic features.

Table 7 Effects of Predicted Keypoint Weighting module. All experiments are conducted on our PV-RCNN++ framework with a center-based RPN head

Full size table

Effects of Predicted Keypoint Weighting The predicted keypoint weighting is proposed in Sec. 3.2 to re-weight the point-wise features of keypoints with extra keypoint segmentation supervision. As shown in Table 7, the experiments show that the performance slightly drops after removing this module, which demonstrates that the predicted keypoint weighting enables better multi-scale feature aggregation by focusing more on the foreground keypoints, since they are more important for the succeeding proposal refinement network. Although this module only leads small additional cost to our frameworks, we should also notice that it is optional for our frameworks by considering its limited gains.

Table 8 Effects of different keypoint sampling algorithms. The running time is the average running time of keypoint sampling process on the validation set of the Waymo Open Dataset. The coverage rate is calculated by averaging the coverage rate of each scene on the validation set of the Waymo Open Dataset. “FPS” indicates the farthest point sampling and “PC-Filter” indicates our proposal-centric filtering strategy. All experiments are conducted by adopting different keypoint sampling algorithms to our PV-RCNN++ framework with a center-based RPN head

Full size table

Fig. 6

Illustration of the keypoint distributions from different keypoint sampling strategies. Some dashed circles are utilized to highlight the missing parts and the clustered keypoints after using these keypoint sampling strategies. We find that our Sectorized-FPS generates better uniformly distributed keypoints that cover more input points to better encode the scene features for proposal refinement, while other strategies may miss some important regions or generate some clustered keypoints

Full size image

Effects of RoI-grid Pooling Module RoI-grid pooling module is proposed in Sec. 3.3 for aggregating RoI features from the very sparse keypoints. Here we investigate the effects of RoI-grid pooling module by replacing it with the RoI-aware pooling (Shi et al., 2020b) and keeping other modules consistent. As shown in the $3^{rd}$ and $4^{th}$ rows Table 5, the experiments show that the performance drops significantly when replacing RoI-grid pooling. It validates that our proposed RoI-grid pooling module can aggregate much richer contextual information to generate more discriminative RoI features.

Compared with the previous RoI-aware pooling module (Shi et al., 2020b), our proposed RoI-grid pooling module can generate denser grid-wise feature representation by supporting different overlapped ball areas among different grid points, while RoI-aware pooling module may generate lots of zeros due to the sparse inside points of RoIs. That means our proposed RoI-grid pooling module is especially effective for aggregating local features from the very sparse point-wise features, such as in our PV-RCNN framework to aggregate features from a very small number of keypoints.

Effects of Proposal-Centric Filtering In the $1^{st}$ and $2^{nd}$ rows of Table 8, we investigate the effectiveness of our proposal-centric keypoint filtering (see Sec. 4.1), where we find that compared with the strong baseline PV-RCNN++ framework equipped with vanilla farthest point sampling, our proposal-centric keypoint filtering further improves the average detection performance by 1.12 mAPH in LEVEL 2 difficulty (65.87% vs. 66.99%). It validates our argument that our proposed proposal-centric keypoint sampling strategy can generate more representative keypoints by concentrating the small number of keypoints to the more informative neighboring regions of proposals. Moreover, improved by our proposal-centric keypoint filtering, our keypoint sampling algorithm is about five times (133ms vs. 27ms) faster than the vanilla farthest point sampling algorithm by reducing the number of candidate keypoints.

Table 9 Effects of different components in our proposed PV-RCNN++ frameworks. All models are trained with

$20\%$

frames from the training set and are evaluated on the full validation set of the Waymo Open Dataset, and the evaluation metric is the mAPH in terms of LEVEL_1 (L1) and LEVEL_2 (L2) difficulties as used in (Sun et al., 2020). “FPS” denotes farthest point sampling, “SPC-FPS” denotes our proposed sectorized proposal-centric keypoint sampling strategy, “VSA” denotes the voxel set abstraction module, “SA” denotes the set abstraction operation and “VP” denotes our proposed VectorPool aggregation. All models adopt the center-based RPN head for proposal generation

Full size table

Effects of Sectorized Keypoint Sampling To investigate the effects of sectorized farthest point sampling (Sec. 4.1), we compare it with four alternative strategies for accelerating the keypoint sampling process: (i) Random Sampling: the keypoints are randomly chosen from raw points. (ii) Voxelized-FPS-Voxel: the raw points are firstly voxelized to reduce the number of points (i.e., voxels), then farthest point sampling is applied to sample keypoints from voxels by taking the voxel centers. (iii) Voxelized-FPS-Point: unlike Voxelized-FPS-Voxel, here a raw point is randomly selected within the selected voxels as the keypoints. (iv) RandomParallel-FPS: the raw points are randomly split into several groups, then farthest point sampling is utilized to these groups in parallel for faster keypoint sampling.

As shown in Table 8, compared with the vanilla farthest point sampling ($2^{nd}$ row) algorithm, the detection performances of all four alternative strategies drop a lot. In contrast, the performance of our proposed sectorized farthest point sampling algorithm is on par with the vanilla farthest point sampling (66.99% vs. 66.87%) while being three times (27ms vs. 9ms) faster than the vanilla farthest point sampling algorithm.

Analysis of the Coverage Rate of Keypoints We argue that the uniformly distributed keypoints are important for the proposal refinement, where a better keypoint distribution should cover more input points. Hence, to evaluate the quality of different keypoint sampling strategies, we propose an evaluation metric, coverage rate, which is defined as the ratio of input points that are within the coverage region of any keypoints. Specifically, for a set of input points ${\mathcal {P}}=\{p_i\}_{i=1}^m$ and a set of sampled keypoints ${\mathcal {K}}=\{p’_j\}_{j=1}^n$, the coverage rate ${\textbf{C}}$ can be formulated as:

$$\begin{aligned} {\textbf{C}} = \frac{1}{m}\cdot {\sum _{i=1}^{m} \min (1.0,~~~ \sum _{j=1}^{n}\mathbbm {1}({||p_i – p’_j|| < R_c}))}, \end{aligned}$$

(13)

where $R_c$ is a scalar that denotes the coverage radius of each keypoint, and $\mathbbm {1}\left( \cdot \right) \in \{0, 1\}$ is the indicator function to indicate whether $p_i$ is covered by $p’_j$.

As shown in Table 8, we evaluate the coverage rate of different keypoint sampling algorithms in terms of multiple coverage radii. Our sectorized farthest point sampling achieves similar average coverage rate (84.76%) with the vanilla farthest point sampling (84.78%), which is much better than other sampling algorithms. The higher average coverage rate demonstrates that our proposed sectorized farthest point sampling can sample more uniformly distributed keypoints to better cover the input points, which is consistent with the qualitative results of different sampling strategies as in Fig. 6.

In short, our sectorized farthest point sampling can generate uniformly distributed keypoints to better cover the input points, by splitting raw points into different groups based on the fact of radial distribution of LiDAR points. Although there may still exist a very small number of clustered keypoints in the margins of different sectors, the experiments show that they have negligible effect on the performance. We consider the reason may be that the clustered keypoints are mostly in the regions around the scene centers, where the objects are generally easier to detect since the raw points around scene centers are much denser than distant regions.

Effects of VectorPool Aggregation In Sec. 4.2, to tackle the resource-consuming problem of set abstraction, we propose VectorPool aggregation module to effectively and efficiently summarize the structure-preserved local features from point clouds. As shown in Table 9, by adopting VectorPool aggregation in both voxel set abstraction module and RoI-grid pooling module, PV-RCNN++ framework consumes much less computations (i.e., a reduction of 4.679 GFLOPS) and GPU memory (from 10.62GB to 7.58GB) than original PV-RCNN framework, while the performance is also consistently increased from 65.92% to 66.87% in terms of average mAPH (LEVEL 2) of three categories. Note that the batch size is only set as 2 in all of our settings and the reduction of memory consumption / calculations can be more significant with larger batch size.

The significant reduction of resource consumption demonstrates the effectiveness of our VectorPool aggregation for feature learning from large-scale point clouds, which makes our PV-RCNN++ framework a more practical 3D detector to be used on resource-limited devices. Moreover, PV-RCNN++ framework also benefits from the structure-preserved spatial features from our VectorPool aggregation, which is critical for the following fine-grained proposal refinement.

We further analyze the effects of VectorPool aggregation by removing channel reduction (Sun et al., 2018) in our VectorPool aggregation. As shown in Table 10, our VectorPool aggregation is effective in reducing memory consumption no matter whether channel reduction is incorporated (by comparing the $1^{st}$ / $3^{rd}$ rows or the $2^{nd}$ / $4^{th}$ rows), since the model activations in our VectorPool aggregation modules consume much less memory than those in set abstraction, by adopting a single local vector representation before multi-layer perceptron networks. Meanwhile, Table 10 also demonstrates that our VectorPool aggregation can achieve better performance than set abstraction (Qi et al., 2017b) in both two cases (with or without channel reduction). Meanwhile, we also notice that VectorPool aggregation slightly improves the number of parameters compared with previous set abstraction module (e.g., from 13.05M to 14.11M for the setting with channel reduction), which is generally negligible given the fact that VectorPool aggregation consumes smaller GPU memory.

Table 10 Effects of VectorPool aggregation with and without channel reduction (Sun et al., 2018). “SA” denotes set abstraction, “VP” denotes VectorPool aggregation module and “CR” denotes channel reduction. “#Param.” indicates the number of parameters of the model. All experiments are based on our PV-RCNN++ framework with a center-based RPN head for proposal generation, and only the local feature extraction modules are changed during the ablation experiments

Full size table

Table 11 Effects of the feature aggregation strategies to generate the local sub-voxel features of VectorPool aggregation. All experiments are based on our PV-RCNN++ framework with a center-based RPN head for proposal generation

Full size table

Effects of Different Feature Aggregation Strategies for Local Sub-Voxels As mentioned in Sec. 4.2, in addition to our adopted interpolation-based method, there are two alternative strategies (average pooling and random selection) for aggregating features of local sub-voxels. Table 11 demonstrates that our interpolation based feature aggregation achieves much better performance than the other two strategies, especially for the small objects like pedestrian and cyclist. We consider that our strategy can generate more effective features by interpolating from three nearest neighbors (even beyond the sub-voxel), while both of the other two methods might generate lots of zero features on the empty sub-voxels, which may degrade the final performance.

Table 12 Effects of separate local kernel weights and the number of dense voxels in our proposed VectorPool aggregation module. All experiments are based on our PV-RCNN++ framework with a center-based head for proposal generation

Full size table

Effects of Separate Local Kernel Weights in VectorPool Aggregation We adopt separate local kernel weights (see Eq. (11)) on different local sub-voxels to generate position-sensitive features. The $1^{st}$ and $2^{nd}$ rows of Table 12 show that the performance drops a bit if we remove the separate local kernel weights and adopt shared kernel weights for relative position encoding. It validates that the separate local kernel weights are better than previous shared-parameter MLP for local feature encoding, and it is important in our proposed VectorPool aggregation module.

Effects of Dense Voxel Numbers in VectorPool Aggregation Table 12 investigates the number of dense voxels $n_x\times n_y \times n_z$ in VectorPool aggregation for voxel set abstraction module and RoI-grid pooling module, where we can see that VectorPool aggregation with $3\times 3\times 3$ and $4 \times 4 \times 4$ achieve similar performance while the performance of $2\times 2\times 2$ setting drops a lot. We consider that our interpolation-based VectorPool aggregation can generate effective voxel-wise features even with large dense voxels, hence the setting with $4\times 4 \times 4$ achieves slightly better performance than the setting with $3\times 3\times 3$. However, since the setting with $4 \times 4 \times 4$ greatly improves the calculations and memory consumption, we finally choose the setting of $3\times 3\times 3$ dense voxel representation in both voxel set abstraction module (except the raw point layer) and RoI-grid pooling module of our PV-RCNN++ framework.

Table 13 Effects of the number of keypoints for encoding the global scene. All experiments are based on our PV-RCNN++ framework with a center-based head for proposal generation

Full size table

Effects of the Number of Keypoints In Table 13, we investigate the effects of the number of keypoints for encoding the scene features. Table 13 shows that larger number of keypoints achieves better performance, and similar performance is observed when using more than 4,096 keypoints. Hence, to balance the performance and computation cost, we empirically choose to encode the whole scene to 4,096 keypoints for the Waymo Open dataset. The above experiments show that our method can effectively encode the whole scene to 4,096 keypoints while keeping similar performance with a large number of keypoints, which demonstrates the effectiveness of the keypoint feature encoding strategy of our proposed PV-RCNN/PV-RCNN++ frameworks.

Table 14 Effects of the grid size in RoI-grid pooling module. All experiments are based on our PV-RCNN++ framework with a center-based head for proposal generation

Full size table

Table 15 Comparison of PV-RCNN/PV-RCNN++ on different sizes of scenes. “FoV” indicates the field of view of each scene, where for each scene in Waymo Open Dataset, we crop a specific angle (e.g., 90

$^{\circ }$

, 180

$^{\circ }$

) of frontal view for training and testing, and 360

$^{\circ }$

FoV indicates the original scene. “#Points” indicates the average number of points in each scene. “Frame Rate” indicates frames per seconds in terms of testing speed

Full size table

Effects of the Grid Size in RoI-grid Pooling. Table 14 shows the performance of adopting different RoI-grid sizes for RoI-grid pooling module. We can see that the performance increases along with the RoI-grid sizes from $3\times 3 \times 3$ to $6\times 6\times 6$, and the settings with larger RoI-grid sizes than $6\times 6\times 6$ achieve similar performance. Hence we finally adopt RoI-grid size $6\times 6\times 6$ for the RoI-grid pooling module. Moreover, from Table 14 and Table 9, we also notice that PV-RCNN++ with a much smaller RoI-grid size $4\times 4 \times 4$ (66.45% in terms of mAPH@L2) can also outperform PV-RCNN with larger RoI-grid size $6\times 6\times 6$ (65.24% in terms of mAPH@L2), which further validates the effectiveness of our proposed sectorized proposal-centric sampling strategy and the VectorPool aggregation module.

Comparison on Different Sizes of Scenes. To investigate the effects of our proposed PV-RCNN++ on handling large-scale scenes, we further conduct ablation experiments to compare the effectiveness and efficiency of PV-RCNN and PV-RCNN++ frameworks on different sizes of scenes. As shown in Table 15, we compare these two frameworks on three sizes of scenes by cropping different angles of frontal view of the scene in Waymo Open Dataset for training and testing. PV-RCNN++ framework consistently outperforms previous PV-RCNN framework on all three sizes of scenes with large performance gains. Table 15 also demonstrates that as the scales of the scenes get larger, PV-RCNN++ becomes much more efficient than PV-RCNN. In particular, when the comparison is conducted on the original scene of Waymo Open Dataset, the running speed of PV-RCNN++ is about three times faster than PV-RCNN, demonstrating the efficiency of PV-RCNN++ on handling large-scale scenes.