Efficient LiDAR Point Cloud Geometry Compression Through Neighborhood Point Attention

Ruixiang Xue, Jianqiang Wang, Zhan Ma
Nanjing University
{xrxee, wangjq}@smail.nju.edu.cn, mazhan@nju.edu.cn
Abstract

Although convolutional representation of multiscale sparse tensor demonstrated its superior efficiency to accurately model the occupancy probability for the compression of geometry component of dense object point clouds, its capacity for representing sparse LiDAR point cloud geometry (PCG) was largely limited. This is because 1) fixed receptive field of the convolution cannot characterize extremely and unevenly distributed sparse LiDAR points very well; and 2) pretrained convolutions with fixed weights are insufficient to dynamically capture information conditioned on the input. This work therefore suggests the neighborhood point attention (NPA) to tackle them, where we first use nearest neighbors (NN) to construct adaptive local neighborhood; and then leverage the self-attention mechanism to dynamically aggregate information within this neighborhood. Such NPA is devised as a NPAFormer to best exploit cross-scale and same-scale correlations for geometric occupancy probability estimation. Compared with the anchor using standardized G-PCC, our method provides 17% BD-rate gains for lossy compression, and 14% bitrate reduction for lossless scenario using popular LiDAR point clouds in SemanticKITTI and Ford datasets. Compared with the state-of-the-art (SOTA) solution using attention optimized octree coding method, our approach requires much less decoding runtime with about 640 speedup on average, while still presenting better compression efficiency.

1 Introduction

LiDAR sensor becomes one of many commodity devices equipped on vehicles, robots and drones for autonomous machinery through the enabling of high-precision geometry perception using dynamically-generated 3D LiDAR point clouds Schwarz et al. (2019). However, efficient compression of massive LiDAR points that is vital for the storage and networked exchange in vast applications Chen et al. (2021) is still challenging because it is hard to exploit inter-correlations among unstructured sparse points. To tackle it, numerous 3D representation models like uniform voxel Wang et al. (2021); Guarda et al. (2021); Quach et al. (2020), octree Meagher (1982), multiscale sparse tensor Wang et al. (2021b, a), etc, are developed to specify explicit neighborhood connections upon which rules or learning based approaches Cao et al. (2021); Quach et al. (2022) exploit inter-dependency among points in close proximity.

Among them, convolutional representation of multiscale sparse tensor, noted as SparsePCGC Wang et al. (2021a), has demonstrated encouraging performance on the compression of geometry component for both dense object and sparse LiDAR point clouds. This is mainly because of the utilization of multiscale and multistage processing to effectively exploit cross-scale and same-scale correlations for accurate geometric occupancy probability approximation. To this aim, SparsePCGC stacks 3D sparse convolutions (SConv) for neighborhood information characterization and embedding. However, the compression gain of LiDAR point cloud geometry (PCG) is relatively marginal when compared with that of dense object PCG. We believe this is because, the neighborhood correlation among extremely and unevenly distributed sparse LiDAR points is hard to model by stacked SConvs that come with the fixed receptive field and fixed weights (after training).

We therefore suggest the Neighborhood Point Attention (NPA) to tackle aforementioned issues. As in Fig. (a)a, we follow Wang et al. (2021a) to use multiscale sparse tensors to process the collection of positively occupied voxels (POVs) from one scale to another, where dyadic downscaling at all axes (e.g., -, - and -axis) in Cartesian coordinate system is used to generate multiscale representations. Such dyadic downscaling is the same as the depth adaptation used in Huang et al. (2020); Que et al. (2021); Fu et al. (2022) to squeeze and build parent-child octree structure for correlation exploration.

Different from the SparsePCGC Wang et al. (2021a) that thoroughly relies on stacked SConvs to exploit correlations for occupancy probability estimation. This work however devises a novel MOPA (Multistage Occupancy Probability Approximation) to fulfill this purpose, where the MOPA accepts the sparse tensor of POVs from preceding scale (e.g., -th), and outputs the probability of each element of upscaled sparse tenor at current scale (e.g., -th) for encoding or decoding as in Fig. (a)a.

As illustrated in Fig. (b)b, for each POV, we first build a local neighborhood using its nearest neighbors (NN), and then execute the self-attention computation in this neighborhood. Such NPA is implemented as a NPAFormer for local information aggregation and embedding. The POVs with aggregated neighborhood features are then upscaled and added a set of offsets to generate eight child nodes (octants) in parallel. After then, the PGM (Probability Generation Module) that stacks NPAFormer and SConv layers stagewisely processes the octant to produce occupancy probability for compression at current scale.

As seen, the use of NN neighborhood in NPAFormer relaxes the receptive field restriction of stacked SConvs used in SparsePCGC Wang et al. (2021a) by flexibly including more valid and effective POVs nearby; while in contrast, fixed convolutional kernel presets like sometimes may only contain very few valid POVs (e.g., 1 or 2) especially in regions having relatively-low point density Zhu et al. (2021), which largely lacks sufficient information for accurate probability estimation. Subsequently neighbors are aggregated using self-attention to weigh the contribution dynamically conditioned on the input neighborhood. Note that NN neighborhood varies accordingly for each specific point, well reflecting the content dynamics, which makes the information embedding more robust and efficient Lu et al. (2022).

Extensive experiments have reported the superior efficiency of the proposed method for compressing LiDAR PCGs like SemanticKITTI and Ford sequences in various settings. Having the anchor using standard compliant G-PCC (Geometry based Point Cloud Compression) WG7 (2021); Cao et al. (2021), multiscale sparse tensor representation using the proposed NPA shows 17% BD-rate (Bjøntegaard Delta rate) gains Bjøntegaard (2001) for lossy compression, and 14% compression ratio improvement in lossless modes, which can be translated to about 10 absolute percentage points improvement over the respective gains obtained by the SparsePCGC Wang et al. (2021a) that applies SConvs upon the same multiscale sparse tensor. In the meantime, compared with the SOTA solution using attention optimized octree coding model, a.k.a., OctAttention Fu et al. (2022), our method not only provides compression improvement, particularly at high bitrates that is more preferred by high-precision tasks in autonomous driving (e.g., 7% BD-rate gains from 6 to 16 bpp), but also enormously reduces the decoding complexity to several orders of magnitude, e.g., less than 6 seconds of the proposed method vs. 1 hour of the OctAttention when decoding SemanticKITTI sequences (about 640 speedup on average in Table 2).

Contributions of this work include: 1) We show that neighborhood point attention that dynamically characterizes and embeds information among nearest neighbors can better aggregate effective information than the fixed-weights and fixed-receptive-field convolutions; 2) Together the multiscale sparse tensor representation, the proposed NPA mechanism can better exploit cross-scale and same-scale correlations with the SOTA efficiency on the compression of sparse LiDAR PCGs; 3) Having the computation of NPA on nearest neighbors significantly reduces the complexity compared with global self-attention, promising attractive prospects for practical applications.

 (a) Multiscale Sparse Tensor Representation exploits cross-scale and same-scale correlations exhaustively using MOPA (Multistage Occupancy Probability Approximation). Estimated probability is used to perform the entropy coding and decoding of geometric occupancy. (b) Each MOPA inputs the sparse tensor of positively occupied voxels (POVs) from preceding scale, e.g., (
(a)
 (a) Multiscale Sparse Tensor Representation exploits cross-scale and same-scale correlations exhaustively using MOPA (Multistage Occupancy Probability Approximation). Estimated probability is used to perform the entropy coding and decoding of geometric occupancy. (b) Each MOPA inputs the sparse tensor of positively occupied voxels (POVs) from preceding scale, e.g., (
(b)
Figure 1: Framework. (a) Multiscale Sparse Tensor Representation exploits cross-scale and same-scale correlations exhaustively using MOPA (Multistage Occupancy Probability Approximation). Estimated probability is used to perform the entropy coding and decoding of geometric occupancy. (b) Each MOPA inputs the sparse tensor of positively occupied voxels (POVs) from preceding scale, e.g., ()-th, to produce occupancy probabilities for encoding/decoding at current scale, e.g., -th, where NPAFormer first aggregates neighborhood information at ()-th scale for each POV, then TSConv upscales each POV (w/ aggregated neighborhood features) to eight child nodes (octant) in parallel and then PGM (Probability Generation Module) stagewisely outputs occupancy probability and determines the occupancy status of each octant. AE and AD are for arithmetic encoding and decoding. SConv is for sparse convolution, and TSConv is transposed SConv.

2 Related Work

Explicit 3D Representation Models of LiDAR PCG. Uniform voxel model is the most straightforward way for PCG representation, but it is more suitable for dense object PCG Wang et al. (2021); Guarda et al. (2021); Quach et al. (2020). Octree model is a lightweight and effective approach through the use of parent-child tree decomposition, with which the compression efficiency is improved by carefully exploiting parent-child dependency. Notable coding solutions using octree include standardized G-PCC that applies rules based contexts, and OctSqueeze Huang et al. (2020), MuSCLE Biswas et al. (2020), VoxelContextNet Que et al. (2021), OctAttention Fu et al. (2022), etc that utilizes neural network based solutuions like CNN (Convolutional Neural Network), MLP (Multi-Layer Perceptron), and Transformers. Recently, a promising multiscale sparse tensor representation using stacked SConvs has emerged Wang et al. (2021b, a) with great flexibility to exploit cross-scale and same-scale correlations. Although the compression gain of multiscale sparse tensor on LiDAR PCGs is relatively marginal Wang et al. (2021a), we believe it is mainly because of the inefficient characterization and embedding of neighborhood points for redundancy removal.

Sparse Tensor. LiDAR point cloud is comprised of a collection of sparsely and unevenly distributed points which inherently can be represented using a sparse tensor consisting of a set of POV coordinates and associated features . In practice, sparse tensor can be easily implemented using Minkowski Engine Choy et al. (2019), where a hash map is applied to index POV coordinates and an auxiliary data structure is also included to manage their connections efficiently. Aforementioned multiscale sparse tensor can be fulfilled by spatially resampling Wang et al. (2021b). Note that only POVs are involved in computations like convolution, leading to attractive lightweight solution.

Sparse Convolution. Upon the sparse tensor, it is naturally to apply the SConv as specified in Choy et al. (2019):

(1)

having and as the coordinate sets for input and output POVs. and are input and output feature vectors at coordinate . defines a 3D convolutional kernel with a predefined size like or , covering a set of locations centered at with offset in . denotes the corresponding kernel weight at offset centered at . As seen, the neighborhood coverage for information aggregation is basically constrained by the kernel size of convolutions; and as in (1), convolutional weights s are fixed after training, which clearly is incapable of effectively characterizing the dynamics unseen in training.

Self-attention & Transformer. Recently, self-attention mechanism and Transformers are migrated quickly to process point clouds, showing encouraging results for different task including segmentation, classification, compression, etc Guo et al. (2021); Zhao et al. (2021); Engel et al. (2021); Fu et al. (2022). However, existing solutions often demand huge computational cost and memory consumption that grow quadratically with the sequence length of underlying tokens. This complexity issue is of particular importance for practical applications. This work thus examines the use of neighborhood point attention to pursue the advantages from both local processing (e.g., lightweight complexity) and self-attention weighting (e.g., dynamic information embedding).

3 Method

3.1 Multiscale Sparse Tensor with MOPA

Figure (a)a illustrates the LiDAR PCG compression framework using multiscale sparse tensor with the proposed MOPA to accurately estimate the occupancy probability for encoding and decoding. As aforementioned, dyadic downscaling is enforced to derive multiscale sparse tensors where scale-wise sparse tensor can be easily mapped to depth-wise octree-structured representation as well. Different from those octree coding approaches Huang et al. (2020); Fu et al. (2022) that exploit correlations following the parent-child tree structure, we leverage cross-scale and same-scale correlations through the use of MOPA for compression.

3.1.1 MOPA - Multistage Occupancy Probability Approximation

Figure (b)b details the MOPA used in this work. In MOPA, it combines the NPA and SConv layers (but not just SConvs as in SparsePCGC Wang et al. (2021a)). Note that NPA Transformer is devised using the popular Transformer architecture for processing, dubbed as the NPAFormer for simplicity.

Assuming the input sparse tensor at -th scale as , the propose MOPA processes all POVs in it to generate the occupancy probability of each element of upscaled sparse tensor at -th scale. Note that because the elements in include both POVs and non-POVs. Whether this element is a POV or non-POV is fully known in encoder for encoding, while the status of POV or non-POV in decoder is determined by parsing the bitstream, through the use of occupancy probability respectively.

The MOPA includes three major steps. As for the processing from -th to -th scale,

  • First, it stacks the ResNet He et al. (2016) and NPAFormer to aggregate neighborhood information at -th scale where SConvs are used in ResNet, and NPAFormer applies the NPA upon nearest neighbors chosen by the Euclidean distance on-the-fly; Note that all POVs in can be processed concurrently.

  • Then, each POV with aggregated neighborhood features in is upscaled using transposed SConv, e.g., “TSConv K1, S2”, with convolutional kernel at a size of 111 and upscaling stride of 2 at three axis dimensions, and added corresponding offset to directly generate eight child nodes (yellow cubic with number mark); Such operation can be executed in parallel for all POVs from scale.

  • Third, a PGM that combines the NPAFormer and SConv layers is devised to stagewisely estimate the occupancy probability for each child node, where the probability approximation of succeeding child node (e.g., node “3” or “2” in yellow cubic) also includes the occupancy status of proceeding child node (e.g., node “1” as POV in dark grey cubic). Note that ReLU (Rectified linear unit) is coupled with SConv for non-linearity, and Sigmoid layer is used for probability derivation in the range of [0,1]. And it is also worth to point out that stage-wise PGM is made concurrently for same-group child nodes at -th scale that are labelled with the same number but upscaled from different POVs from -th scale.

3.1.2 NPAFormer - Transformer using Neighborhood Point Attention

Earlier Transformers often conducted self-attention computation globally Guo et al. (2021), inevitably incurring unbearable complexity for applications using large-scale point clouds. Recently, a number of explorations on 2D images Liu et al. (2021); Lu et al. (2022) proved that examining self-attention in local neighborhood could retain almost the same performance in various tasks but with significant reduction of complexity. However, unlike 2D images with structured topology, point cloud is essentially an unordered set of points in 3D space that is difficult to model using a regular size window. This motivates us to develop the NPAFormer that constructs local neighborhood dynamically conditioned on individual POV input.

As aforementioned, we embed NPAFormer in MOPA to best exploit the correlations from nearest neighbors for effective information aggregation. As seen in Fig. (b)b, the structure of NPAFormer is similar to standard Transformer architecture, consisting a NPA layer for neighborhood point attention computation, normalization (Norm) and linear layers as well as the residual links He et al. (2016).

Assuming the input of a NPA layer is a sparse tensor consisting of -dimensional features and -dimensional coordinates . We then conduct the NN search for each element in input sparse tensor, yielding dynamic NN tensor .

Position Embedding. Unconstrained displacement of points allows us to accurately and flexibly represent 3D space. But apparently it makes the embedding of geometry position more difficult by using the absolute value. Instead, we propose to concatenate the relative position with corresponding features, e.g.,

(2)

with which we can best retain the spatial coherency for information characterization in local neighborhood.

Neighborhood Point Attention Architecture.
Figure 2: Neighborhood Point Attention Architecture.

Neighborhood Point Attention (NPA). As revealed by its definition, the proposed NPA applies the attention based dynamic weighting among nearest neighbors of each input POV (as the central point). The overall architecture of NPA is shown in Fig. 2.

Let , and be the , , and vectors respectively. is computed through a linear transformation of ; and are computed by two separate linear transformations of :

(3)

having linear transformations and . Here, is the dimension of input feature and is the dimension of , and . Then the NPA can be calculated using:

(4)

The dot product of and is divided by and then passed through the function to derive the Attention Map, then the dot product of Attention Map and is denoted as . The output of NPA comprises of and .

Multihead NPA. To best capture correlations from different representation subspaces, we extend aforementioned Singlehead NPA to Multihead NPA following the rules defined for standard Transformers Vaswani et al. (2017). We use several different linear projection upon the , , and vectors, and then feed them into multiple NPA module in parallel. The ouputs of parallel NPAs are concatenated together and passed through a linear transformation. We generally adapt the number of attention heads (#ah) in Multihead NPA, e.g., #ah at , and enforce the product of #ah and associated channels per head #cph fixed to 32 to have the fixed total dimensions for lightweight processing. As reported in Table 3, the result shows that having #ah = 4 and #cph = 8 is a balanced option.

Complexity Analysis. Having NPA computation on local neighborhood with elements avoids the quadratic increase of computation and memory cost with respect to the number of points contrast with methods using Global Self-Attention. Here, for simplicity, we use the single-head NPA as an example for discussion. Multihead NPA can be easily extended. Assuming the total amount of points is , the number of neighbors in local neighborhood is , and the dimensional size of feature is . Note that we often have and in practical settings (e.g., , in our examples but is typically hundreds or tens of thousand). Thus, the computation complexity of NPA and Global Self-Attention can be approximated as:

(5)
(6)

As we can see, our NPA greatly reduces the complexity for attention computation, making our solution attractive in practices.

4 Results

4.1 Dataset

Prevalent SemanticKITTI Behley et al. (2019) and Ford sequences 20 provided by the MPEG committee are evaluated.

SemanticKITTI is a publicly-accessible large-scale LiDAR point clouds dataset used for semantic scene understanding. It includes 43,552 raw scans (frames) with 4.5 billions points collected from a Velodyne HDL-64E sensor. We use sequences from #00 to #10 (23,201 scans in total) for training, and the remaining sequences from #11 to #21 (20,351 scans) for testing, following the official training/testing split as suggested.

Ford is dynamically acquired for MPEG point clouds compression standardization. It contains 3 sequences, e.g., Ford01, Ford02 and Ford03, each of which includes 1500 frames at 1mm precision (or 18 bits per geometry component). We take all of them as our testing sequence.

4.2 Experiments

Baselines include both rules- and learning-based PCG coding solutions for comparative studies:

  • G-PCC 21 uses the latest reference model TMC13v14 21 with rules-based octree codec to generate anchor following the common test conditions (CTC) defined in WG7 (2021).

  • SparsePCGC Wang et al. (2021a) represents the use of multiscale sparse tensor but with stacked SConvs;

  • OctAttention Fu et al. (2022), VoxelContextNet Que et al. (2021) & OctSqueeze Huang et al. (2020) are learned octree codecs.

We faithfully reproduce the results of SparsePCGC and OctAttention from their source codes. Since the source codes of VoxelContextNet and OctSqueeze are not publicly accessible, we directly cite the results from their papers. Following the suggestions in OctAttention Fu et al. (2022) for fair comparison, we remove the Coordinate Refinement Module (CRM) in both VoxelContextNet and SparsePCGC methods. Such CRM is a post-processing module that can be augmented for all approaches.

Training. In the training procedure, we quantize native floating-point LiDAR geometry in SemanticKITTI training frames to 1mm unit precision by (7) first and use them to generate our training samples, e.g.,

(7)

As aforementioned, our MOPA works across the adjacent scales. Having the native resolution of training samples, e.g., scaling factor , we downscale them to lower scales to form a set of pairs, e.g., (1/2, 1/4), (1/8, 1/16) etc, for model training.

This work trains two models. One is for the relatively-low bitrate range mainly used by OctAttention, VoxelContextNet and OctSqueeze. To produces these low bitrates, input PCGs are downscaled with from to . Note that the scaling factors can be mapped to the octree depth as in Fu et al. (2022); Huang et al. (2020); Que et al. (2021). The other model is trained for high bitrates with from to . Having an alternative high-bitrate model is because of the desire of (ultra) high-precision geometry precision in autonomous machinery where reconstruction of LiDAR PCGs at high bitrates is more preferable for application purpose. Finetuning the high-bitrate model is due to very different characteristics (e.g., point density) at different scales. Learned model is trained with binary cross entropy loss in supervised end-to-end means.

Testing is enforced with the fair and best-reproducible principles in this study. We strictly follow the test conditions in other approaches or directly quote results from their publications. Specifically, when comparing with the anchor like G-PCC or SparsePCGC Wang et al. (2021a), we scale input LiDAR PCGs with from 1 to to obtain proper bitrates that match the anchor settings. When comparing with prevalent learned octree coding approaches, e.g., OctAttention Fu et al. (2022), VoxelContextNet Que et al. (2021), and OctSqueeze  Huang et al. (2020), testing PCGs are quantized from 8 to 12 bits per geometry component through the use of  (8) defined in Fu et al. (2022), e.g.,

(8)

is the normalized PCGs in , is the max depth of octree from 8 to 12. , and represent coordinate level in a given PCG.

Evaluation Metrics. We closely follow the G-PCC CTC WG7 (2021) to measure the bit rate using bpp (bit per input point), and quantitatively measure the distortion using PSNR (Peak Signal-to-Noise Ratio) of both point to point error (D1) and point to plane error (D2). Note that compression ratio gain measured by bpp is for lossless mode while the BD-rate Bjontegaard (2001) evaluates the lossy compression.

As different approaches may use different normalization for evaluation. For fair comparison, we replicate the settings used by them accordingly. For example, when comparing with G-PCC and SparsePCGC Wang et al. (2021a), we set PSNR peak value to following the G-PCC CTC WG7 (2021); When comparing with the OctAttention Fu et al. (2022), VoxelContextNet Que et al. (2021) and OctSqueeze Huang et al. (2020), we normalize the points to the range of and set PSNR peak value to 1 following their rules.

For complexity measurement, we report the encoding and decoding runtime for reference. Tests are examined on a computer with an Intel Xeon 6226R CPU and an Nvidia GeForce RTX 3090 GPU.

LiDAR PCGs Lossless (bpp) Lossy
D1-BD-rate D2-BD-rate
G-PCC SparsePCGC Wang et al. (2021a) Ours
Gain over
G-PCC
SparsePCGC Wang et al. (2021a) Ours SparsePCGC Wang et al. (2021a) Ours
KITTI_vox2cm 7.62 7.06 6.18 -18.90% -10.22% -18.01% -8.40% -17.96%
Ford_vox2cm 9.95 9.69 8.53 -14.27% -10.18% -18.53% -9.18% -18.53%
KITTI_vox1mm 20.17 19.73 16.62 -17.60% -6.80% -18.14% -6.24% -18.11%
Ford_vox1mm 22.31 22.27 19.86 -10.98% -6.12% -16.08% -5.51% -16.07%
Average 15.01 14.69 12.80 -14.72% -8.33% -17.69% -7.33% -17.67%
Table 1: Compression Performance Evaluation Using LiDAR PCGs for both Lossless and Lossy Scenarios. Anchor is standardized G-PCC using TMC13v14, and SparsePCGC is also provided.
R-D comparison on SemanticKITTI and Ford samples across a wide range of bitrates. R-D comparison on SemanticKITTI and Ford samples across a wide range of bitrates. R-D comparison on SemanticKITTI and Ford samples across a wide range of bitrates. R-D comparison on SemanticKITTI and Ford samples across a wide range of bitrates.
Figure 3: R-D comparison on SemanticKITTI and Ford samples across a wide range of bitrates.

Quantitative Efficiency. As shown in Table 1, the proposed method offers more than 14% and 17% gains (on average) over the G-PCC anchor for respective lossless and lossy scenarios across a variety of test sequences at different precisions (1mm and 2cm). Performance improvements for lossy mode are also confirmed by sample rate-distortion (R-D) plots in Fig. 3 where we can observe consistent improvement across a wide range of bitrates.

Having the same G-PCC anchor, our method offers 10 absolute percentage points increase for both lossy and lossless mode in comparison to the SparsePCGC (see Table 1 and Fig. 3). Given that our method uses the same multiscale sparse tensor as the SparsePCGC, the performance improvement reports the superior efficiency of NPA mechanism by resolving the limitations of stacked SConvs used in Wang et al. (2021a) to aggregate neighborhood information for better occupancy probability approximation.

Comparisons are also conducted with other octree coding approaches as shown in Fig. 5 where we apply the same bitrate range used by OctAttention, VoxelContextNet and OctSqueeze. These bitrates belong to a relatively lower range. Connecting with the experimental illustrations in SOTA OctAttention, our method shows almost the same R-D behavior with overlapped R-D curves in Fig. 5. As seen, our method and OctAttention report BD-rate gains over the VoxelContextNet at the point with the highest bpp, and consistently outperform the OctSqueeze and G-PCC for all bitrates.

Our method shows even larger gains over the SOTA OctAttention, e.g., BD-rate improvement on average, when we extend the bitrates to higher level as shown in Fig. 5. This further evidences the superior efficiency of our method where the proposed NPA just using local neighbors can offer better performance than the default settings in OctAttention with context window size at 1024 to gather the information of sibling and ancestor nodes as much as possible.

Note that we produce the high birate results of OctAttention by re-adapting its model to enable the max depth of octree from 13 to 16. Unfortunately, we could not provide R-D performance at higher bitrates for VoxelContextNet and OctSqueeze because of the lack of their source codes for high bitrate model training. However, since the OctAttention offers the SOTA efficiency, it is indeed a convincing representative of learned octree coding approaches.

Due to the page limitation, qualitative visualization is given in supplemental material where the improvement of the proposed method is clearly presented.

R-D comparison at lower bitratesR-D comparison at lower bitrates
Figure 4: R-D comparison at lower bitrates
R-D comparison at lower bitratesR-D comparison at lower bitrates
Figure 5: R-D comparison at higher bitrates
Encoding Time Methods D=8 D=9 D=10 D=11 D=12 D=13 D=14 D=15 D=16 Ave. G-PCC 0.50 0.62 0.82 1.07 1.61 2.05 2.53 2.78 3.02 1.67 OctAttention Fu et al. (2022) 0.40 0.49 0.54 0.44 0.66 1.51 1.91 2.22 2.58 1.19 SparsePCGC Wang et al. (2021a) 2.08 2.91 4.36 7.11 10.58 14.70 18.68 22.79 20.88 11.57 Ours 1.18 1.38 1.79 2.62 4.11 5.35 8.05 10.65 13.34 5.39
Decoding Time Methods D=8 D=9 D=10 D=11 D=12 D=13 D=14 D=15 D=16 Ave. G-PCC 0.05 0.09 0.16 0.31 0.54 0.72 1.02 1.19 1.29 0.60 OctAttention Fu et al. (2022) 79.17 227.79 633.66 782.11 1569.79 6066.57 6865.55 6628.52 8914.32 3529.72 SparsePCGC Wang et al. (2021a) 2.88 3.54 4.76 6.97 11.19 14.51 18.28 22.43 21.39 11.77 Ours 1.56 1.89 2.21 2.93 3.85 5.87 7.47 11.68 12.48 5.55
Table 2: Comparisons of encoding and decoding time measured in seconds. Various octree depths s are evaluated and SemanticKITTI sequences are exemplified. Time is measured for each sequence.
\captionof tableModel sizes Method Mbytes OctAttention Fu et al. (2022) 16.13 SparsePCGC Wang et al. (2021a) 3.95 Ours 8.96
Table 3: Impact of num. of attention heads on Bpp at various s.
#ah S=1/256 S=1/128 S=1/64 S=1/32 S=1/16 S=1/8 S=1/4 S=1/2 S=1
1 0.52 1.21 2.48 4.34 6.71 9.23 11.92 14.74 17.57
2 0.52 1.22 2.51 4.37 6.68 9.06 11.55 14.24 16.78
4 0.53 1.23 2.52 4.38 6.70 9.04 11.47 14.12 16.62

Complexity is critical for the application of LiDAR PCG compression in real-life scenarios. Comparisons are performed among representative methods including the G-PCC using rules-based octree coding, OctAttention using learned octree coding, SparsePCGC using multiscale sparse tensor with learned SConvs, and our method. We first compare the encoding and decoding time respectively at different bitrates with results shown in Table 2. As seen, conventional G-PCC demonstrates lightweight and balanced complexity consumption for encoding and decoding. Note that G-PCC runs mainly on CPU while other learned methods heavily rely on the GPU and CPU. Thus the measures of G-PCC are just used as the reference to understand the computational complexity in general.

Though OctAttention presents even less encoding time (on average) than the G-PCC, its decoding complexity is unbearable. This is because at encoder, massive parallelism can be fulfilled due to the full availability of information, while in decoder, causal dependency strictly limits the sequential processing.

Our method shows comparable complexity for encoding and decoding, e.g., less than 6 seconds on average. It is about 640 speedup of decoding when compared with the OctAttention, and about 2 speedup of SparsePCGC. It is worth to point out that the runtime of our method can be further greatly reduced, since our current implementation is just an evaluation prototype. For instance, NN search is not optimized, requiring frequent interruptions for data transfer between GPU and CPU, which apparently increases the processing time. A speedup of the encoding and decoding by another 5 is possible according to our pilot exploration.

Model size measures the space complexity. Table 3 lists the consumption in Mbytes for OctAttention, SparePCGC and ours. As seen, our method requires about a half of model size to the OctAttention, while it doubles the size as compared to that of the SparsePCGC. Lately, we have noticed that the increase of model size of the proposed method mainly comes from the ResNet modules in NPAFormer which can be removed without noticeable performance loss according to our preliminary study. This is deferred as our future work.

k S=1/256 S=1/128 S=1/64 S=1/32 S=1/16 S=1/8 S=1/4 S=1/2 S=1
8 0.54 1.25 2.59 4.56 6.95 9.36 11.85 14.55 17.05
16 0.53 1.23 2.52 4.38 6.70 9.04 11.47 14.12 16.62
24 0.53 1.23 2.53 4.41 6.73 9.09 11.56 14.24 16.67
32 0.53 1.24 2.55 4.45 6.76 9.14 11.63 14.32 16.89
Table 4: Impact of number of nearest neighbors on Bpp at various s

4.3 Ablation Studies

This section examines the impact of neighborhood size and number of attention heads #ah to get more insights on the proposed NPA. Extra examinations are given in supplemental material.

Number of Attention Heads #ah. We adapt #ah at different bitrates to understand its impact on compression efficiency. To this aim, we adjust the scaling factor from 1 (lossless) to 1/256 to reach different bitrate points. Since the distortion is the same for different #ah at the same , we only present the bpp in Table 3. to report the performance, i.e., the smaller bpp the better compression efficiency. In the end, we choose #ah = 4 in this work. Although the setting of #ah = 4 provides marginal loss at lower bitrates (e.g., BD-rate loss when is from 1/256 to 1/32), it improves the coding efficiency noticeably at higher bitrates, e.g., 5.41% bpp reduction when = 1 (lossless mode), and 2.5% BD-rate improvement when is from 1/16 to 1/2.

Number of Nearest Neighbors . Table 4 gives averaged Bpps at different s when adapting the in proposed NPAFormer for attention. Similarly, the smaller Bpp the better coding efficiency (for the same distortion at a given ). As seen, = 16 gives the best compression performance. Having a smaller than 16, performance is deteriorated because of insufficient neighbors used for information aggregation; while having a larger may include more irrelevant (or uncorrelated) neighbors, making the attentive weighting compromised. Also, larger expects extra computation overhead potentially. Recalling the comparison with the OctAttention in Fig. 5 and 5, a small number of local neighbors may be already sufficient for constructing accurate contexts used in compression.

5 Discussion

Efficient compression of LiDAR point clouds is increasingly demanded for the enabling of networked autonomous machinery where high-precision LiDAR PCGs can be shared for various tasks. However, the inter-correlation among dynamically acquired and unevenly distributed sparse LiDAR points is hard to characterize by convolutions with fixed receptive field and fixed weights (after training). This work therefore suggested the neighborhood point attention (NPA) to adaptively weigh the contributions from nearest neighbors that are constructed dynamically conditioned on the input point for effective information aggregation and embedding. We devised this NPA together with the multiscale sparse tensor representation, showing the SOTA efficiency on the compression of typical LiDAR PCGs in both lossy and lossless modes. More importantly, the proposed method also demonstrated encouraging complexity-performance tradeoff even under current prototype implementation, e.g, few seconds for both encoding and decoding as exemplified.

6 Supplemental Material

This companion document supplements the main text with additional experiments and discussions, further evidencing the superior efficiency and robust generalization of the proposed method.

6.1 Qualitative Visualization

We mainly give the quantitative evaluations in the main text (see Sec. 4.2) due to the page limitation. Here, we offer additional qualitative visualizations in Fig. 6 to further demonstrate the efficiency of our method. As we can clearly observe, our method provides the least reconstruction error at close bitrates for both SemanticKITTI and Ford test samples, e.g., as for the #000000 frame in SemanticKITTI #11 sequence, the Bpp of our method is 4.78 and the corresponding PSNR of the point-to-point (D1) error is 73.45 dB, while SparsePCGC Wang et al. (2021a) and G-PCC present lower PSNR at respective 71.99 dB & 70.22 dB and cost more bits, e.g., 4.90 Bpp & 4.86 Bpp. The similar observation happens for other testing frames.

Qualitative visualization of LiDAR PCG reconstructions at typical bit rates. Color map is placed to reflect the reconstruction error distribution.
Figure 6: Qualitative visualization of LiDAR PCG reconstructions at typical bit rates. Color map is placed to reflect the reconstruction error distribution.

6.2 Model Performance on Object Point Clouds

Here we report the performance using extra object point clouds that are recommended by MPEG standardization committee to present the efficiency evaluations as well as the robust generalization of the proposed method. We divide the experiment into two parts, one is for the solid object point clouds and the other is for the dense object point cloud. The difference between this two types of point clouds is that solid object point clouds have a higher density and lower precision compared with dense object point clouds.

6.2.1 Solid Object Point Clouds

In this subsection, we evaluate our method using solid object point clouds whose geometry distribution characteristic is totally different from the LiDAR PCGs studied in the main text. Baseline methods consist of MPEG G-PCC, SparsePCGC Wang et al. (2021a) and OctAttention Fu et al. (2022). The anchor result of G-PCC is generated using the latest TMC13v14 21, and we reproduce the result of SparsePCGC Wang et al. (2021a) and OctAttention Fu et al. (2022) faithfully following their methods described in the papers.

Dataset. Following the setting of OctAttention Fu et al. (2022), we use Soldier10 and Longdress10 sequences in MPEG 8i d’Eon et al. (May 2016), Andrew10, David10, and Sarah10 sequences in MVUB Charles et al. (May 2016) for training. Other point clouds in MPEG 8i d’Eon et al. (May 2016) including Thaidancer, Boxer, Redandblack and Loot are selected for testing. Note that the PCGs mentioned above are presented with 9 or 10 bit precision.

Training and Testing. In the training procedure, we train one single model with scaling factor from 1 to . In the testing phase, being aligned with the OctAttention Fu et al. (2022) that merely lossless compression results are offered, we only evaluate and compare the performance in lossless mode. Other settings including evaluate metrics and loss function are the same as described in Sec. 4.2.

Results. The result of lossless compression for solid object PCGs is detailed in Table 5. It shows that our method achieves superior performance compared with previous methods including the G-PCC, SparsePCGC Wang et al. (2021a) and OctAttention Fu et al. (2022). Note that the number of neighbors is increased to 64 to have more neighbors for information aggregation, and the performance improvement shown in Table 5 has proved its efficiency compared with = 16 and = 32. Even though, = 64 is still much smaller than the 1024 elements used in OctAttention. Our method achieves 37% and 40% gains over G-PCC on average in terms of compression ratio measured by Bpp for lossless compression.

Solid Object PCGs G-PCC SparsePCGC Wang et al. (2021a) OctAttention Fu et al. (2022) Ours
Bpp Bpp
Gain over
G-PCC
Bpp
Gain over
G-PCC
=16 =32 =64
Bpp
Gain over
G-PCC
Bpp
Gain over
G-PCC
Bpp
Gain over
G-PCC
10 bit Thaidancerviewdepvox10 0.99 0.65 -34.34% 0.65 -34.34% 0.67 -32.32% 0.62 -37.37% 0.62 -37.37%
boxerviewdepvox10 0.94 0.60 -36.17% 0.59 -37.23% 0.63 -32.98% 0.57 -39.36% 0.58 -38.30%
lootvox101200 0.97 0.63 -35.05% 0.63 -35.06% 0.66 -31.96% 0.62 -36.08% 0.61 -37.11%
redandblackvox101550 1.10 0.72 -34.55% 0.74 -32.73% 0.76 -30.91% 0.71 -35.45% 0.70 -36.36%
Average 1.00 0.65 -35.00% 0.65 -35.00% 0.68 -32.00% 0.63 -37.00% 0.63 -37.00%
9 bit Thaidancerviewdepvox9 0.99 0.64 -35.35% 0.64 -35.35% 0.67 -32.32% 0.63 -36.36% 0.60 -39.39%
boxerviewdepvox9 0.96 0.60 -37.50% 0.59 -38.54% 0.62 -35.42% 0.59 -38.54% 0.56 -41.67%
Average 0.98 0.62 -36.73% 0.62 -36.73% 0.65 -33.67% 0.61 -37.76% 0.58 -40.82%
Table 5: Lossless Compression Gains for Solid Object PCGs Measured by Bpp

6.2.2 Dense Object Point Clouds

Besides, we also test PCGs with higher geometry bit-depths and levels of sparsity which are referred to as the dense object point clouds. Since the SOTA OctAttention Fu et al. (2022) does not include this type of point clouds into the performance evaluation, we exclude it for comparative study below. We faithfully reproduce the result of SparsePCGC Wang et al. (2021a) and the anchor result of G-PCC is generated by the latest TMC13v14 21 as well.

Dataset. We select loot_view_vox12, redandblack_viewdep_vox12, longdress_viewdep_vox12, head_00039_vox12 and frog_00067_vox12 in G-PCC CTC WG7 (2021) following the recommendation of  WG7 (2022) for training. The testing PCGs include soldier_viewdep_vox12, boxer_viewdep_vox12 and Thaidancer_viewdep_vox12. Note that PCGs mentioned above are all with 12-bit precision.

Training and Testing. In the training procedure, we train one model with scaling factor from 1 to , and the number of neighbors is set as 16. In the testing procedure, we scale input PCGs with from 1 to to perform rate-distortion control when evaluating lossy compression performance. Other settings including evaluate metrics and loss function are the same as described in Sec. 4.2 as well.

Results. The proposed method not only shows great performance on the compression of LiDAR PGCs described in the main text, but also for these dense object PCGs. As shown in Table 6, for lossless mode, our method achieves 50.18% Bpp reduction against the G-PCC and 23.24% Bpp saving over the SparsePCGC Wang et al. (2021a). In lossy mode, our method offers more than 16% and 17% gains over the G-PCC in terms of D1 BD-rate and D2 BD-rate, which is around another 3% BD-rate improvement when comparing with the SparsePCGC Wang et al. (2021a).

Dense Object PCGs Lossless (Bpp) Lossy
G-PCC SparsePCGC Wang et al. (2021a) Ours
Gains over
G-PCC
D1-BD-rate D2-BD-rate
SparsePCGC Wang et al. (2021a) Ours SparsePCGC Wang et al. (2021a) Ours
12 bit soldierviewdepvox12 3.86 2.62 2.00 -48.19% -16.07% -19.79% -17.07% -20.07%
boxerviewdepvox12 3.58 2.15 1.48 -58.66% -15.71% -19.15% -16.60% -19.99%
Thaidancerviewdepvox12 1.10 0.78 0.78 -29.09% -6.90% -9.76% -8.21% -11.21%
Average 2.85 1.85 1.42 -50.18% -12.89% -16.23% -13.96% -17.09%
Table 6: Compression Performance Evaluation Using Dense Object PCGs for both Lossless and Lossy Scenarios. Anchor is standardized G-PCC using TMC13v14, and SparsePCGC is also provided.

References

  • [1] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall (2019-10) SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §4.1.
  • [2] S. Biswas, J. Liu, K. Wong, S. Wang, and R. Urtasun (2020) Muscle: multi sweep compression of lidar using deep entropy models. Advances in Neural Information Processing Systems 33, pp. 22170–22181. Cited by: §2.
  • [3] G. Bjøntegaard (2001-04) Calculation of average PSNR differences between rd-curves. In ITU-T SG 16/Q6, 13th VCEG Meeting, Cited by: §1.
  • [4] G. Bjontegaard (2001) Calculation of average psnr differences between rd-curves. VCEG-M33. Cited by: §4.2.
  • [5] C. Cao, M. Preda, V. Zakharchenko, E. S. Jang, and T. Zaharia (2021) Compression of sparse and dense dynamic point clouds—methods and standards. Proceedings of the IEEE 109 (9), pp. 1537–1558. External Links: Document Cited by: §1, §1.
  • [6] L. Charles, C. Qin, O. Sergio, and A. C. Philip (May 2016) Microsoft voxelized upper bodies - a voxelized point cloud dataset. ISO/IEC JTC1/SC29 Joint WG11/WG1 (MPEG/JPEG) m38673/M72012. Cited by: §6.2.1.
  • [7] S. Chen, B. Liu, C. Feng, C. Vallespi-Gonzalez, and C. Wellington (2021-01) 3D point cloud processing and learning for autonomous driving: impacting map creation, localization, and perception. IEEE Signal Processing Magazine. 38 (1), pp. 68–86. Cited by: §1.
  • [8] C. Choy, J. Gwak, and S. Savarese (2019) 4d spatio-temporal convnets: minkowski convolutional neural networks. In IEEE CVPR, Cited by: §2, §2.
  • [9] E. d’Eon, B. Harrison, T. Myers, and P. A. Chou (May 2016) 8i voxelized full bodies - a voxelized point cloud dataset. ISO/IEC JTC1/SC29 Joint WG11/WG1 (MPEG/JPEG) m38673/M72012. Cited by: §6.2.1.
  • [10] N. Engel, V. Belagiannis, and K. Dietmayer (2021) Point transformer. IEEE Access 9, pp. 134826–134840. Cited by: §2.
  • [11] C. Fu, G. Li, R. Song, W. Gao, and S. Liu (2022) OctAttention: octree-based large-scale contexts model for point cloud compression. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §1, §2, §2, §3.1, 3rd item, §4.2, §4.2, §4.2, §4.2, Table 2, Table 2, Table 3, §6.2.1, §6.2.1, §6.2.1, §6.2.1, §6.2.2, Table 5.
  • [12] A. F. R. Guarda, N. M. M. Rodrigues, and F. Pereira (2021) Adaptive deep learning-based point cloud geometry coding. IEEE Journal of Selected Topics in Signal Processing 15, pp. 415–430. Cited by: §1, §2.
  • [13] M. Guo, J. Cai, Z. Liu, T. Mu, R. R. Martin, and S. Hu (2021-04) PCT: point cloud transformer. Computational Visual Media 7 (2), pp. 187–199. External Links: ISSN 2096-0662, Link, Document Cited by: §2, §3.1.2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE CVPR, pp. 770–778. Cited by: 1st item, §3.1.2.
  • [15] L. Huang, S. Wang, K. Wong, J. Liu, and R. Urtasun (2020) OctSqueeze: octree-structured entropy model for lidar compression. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1310–1320. Cited by: §4.2.
  • [16] L. Huang, S. Wang, K. Wong, J. Liu, and R. Urtasun (2020-06) OctSqueeze: octree-structured entropy model for lidar compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.1, 3rd item, §4.2, §4.2.
  • [17] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Cited by: §3.1.2.
  • [18] M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma (2022) Transformer-based image compression. IEEE Data Compression Conference. Cited by: §1, §3.1.2.
  • [19] D. Meagher (1982) Geometric modeling using octree encoding. Computer graphics and image processing 19 (2), pp. 129–147. Cited by: §1.
  • [20] MPEG pcc dataset. Note: http://mpegfs.int-evry.fr/mpegcontent/Accessed: 2022 Cited by: §4.1.
  • [21] MPEG-pcc-tmc13. Note: https://github.com/MPEGGroup/mpeg-pcc-tmc13Accessed: 2022 Cited by: 1st item, §6.2.1, §6.2.2.
  • [22] M. Quach, J. Pang, T. Dong, G. Valenzise, and F. Dufaux (2022) Survey on deep learning-based point cloud compression. Frontiers in Signal Processing. Cited by: §1.
  • [23] M. Quach, G. Valenzise, and F. Dufaux (2020) Improved deep point cloud geometry compression. 2020 IEEE MMSP Workshop. Cited by: §1, §2.
  • [24] Z. Que, G. Lu, and D. Xu (2021-06) VoxelContext-net: an octree based framework for point cloud compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6042–6051. Cited by: §1, §2, 3rd item, §4.2, §4.2.
  • [25] Z. Que, G. Lu, and D. Xu (2021) VoxelContext-net: an octree based framework for point cloud compression. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4.2.
  • [26] S. Schwarz, M. Preda, V. Baroncini, et al. (2019) Emerging mpeg standards for point cloud compression. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, pp. 133–148. Cited by: §1.
  • [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §3.1.2.
  • [28] J. Wang, D. Ding, Z. Li, X. Feng, C. Cao, and Z. Ma (2021) Sparse tensor-based multiscale representation for point cloud geometry compression. arXiv preprint arXiv:2111.10633. Cited by: §1, §1, §1, §1, §1, §1, §2, §3.1.1, 2nd item, §4.2, §4.2, §4.2, Table 1, Table 2, Table 2, Table 3, §6.1, §6.2.1, §6.2.1, §6.2.2, §6.2.2, Table 5, Table 6.
  • [29] J. Wang, D. Ding, Z. Li, and Z. Ma (2021) Multiscale point cloud geometry compression. In IEEE Data Compression Conference (DCC), pp. 73–82. Cited by: §1, §2, §2.
  • [30] J. Wang, H. Zhu, H. Liu, and Z. Ma (2021-Dec.) Lossy point cloud geometry compression via end-to-end learning. IEEE Transactions on Circuits and Systems for Video Technology 31 (12), pp. 4909–4932. Cited by: §1, §2.
  • [31] WG7 (2021) Common test conditions for g-pcc. ISO/IEC JTC1/SC29/WG11 N00106. Cited by: §1, 1st item, §4.2, §4.2, §6.2.2.
  • [32] WG7 (2022) Dataset split for ai-pcc. ISO/IEC JTC1/SC29/WG7 m58754. Cited by: §6.2.2.
  • [33] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun (2021) Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268. Cited by: §2.
  • [34] W. Zhu, Y. Xu, D. Ding, Z. Ma, and M. Nilsson (2021) Lossy point cloud geometry compression via region-wise processing. IEEE Transactions on Circuits and Systems for Video Technology 31 (12), pp. 4575–4589. Cited by: §1.