Scatter Points in Space: 3D Detection from Multi-view Monocular Images

Jianlin Liu

^{1}

Zhuofei Huang

^{2}

Dihe Huang

^{3}

Shang Xu

^{1}

Ying Chen

^{1}

&Yong Liu

^{1}

\affiliations

^{1}

Tencent

^{2}

HKUST

^{3}

Tsinghua University \emails{jenningsliu, shangxu, mumuychen, choasliu}@tencent.com, zhuangbr@cse.ust.hk, hdh20@mails.tsinghua.edu.cn

Abstract

3D object detection from monocular image(s) is a challenging and long-standing problem of computer vision. To combine information from different perspectives without troublesome 2D instance tracking, recent methods tend to aggregate multi-view feature by sampling regular 3D grid densely in space, which is inefficient. In this paper, we attempt to improve multi-view feature aggregation by proposing a learnable keypoints sampling method, which scatters pseudo surface points in 3D space, in order to keep data sparsity. The scattered points augmented by multi-view geometric constraints and visual features are then employed to infer objects location and shape in the scene. To make up the limitations of single frame and model multi-view geometry explicitly, we further propose a surface filter module for noise suppression. Experimental results show that our method achieves significantly better performance than previous works in terms of 3D detection (more than 0.1 AP improvement on some categories of ScanNet). The code will be publicly available.

Figure 1: Problem definition. The aim is to infer 3D orientated boxes and meshes from multi-view posed RGB images with 2D detection boxes. For multi-view feature aggregation, dense grid points are prevalently used by existing methods. Our method learns to generate dynamic keypoints around object surface first as container for feature aggregation.

1 Introduction

3D object detection usually requires accurate active depth sensing techniques including LiDAR and structured-light cameras, which are prohibitively expensive or limited to certain circumstances. Therefore, some researchers turn to using RGB image(s) instead. Remarkable progress has been made in the last few years for monocular 3D detection, yet the performance is limited by two major problems. On the one hand, field-of-view(FoV) is often too small to capture large object. On the other hand, estimating depth from a single monocular image is an ill-posed problem. On the contrary, leveraging multi-view images for 3D detection can help resolve these limitations. To explicitly make use of multi-view visual clue, it usually requires matching instance boxes or pixels across 2D images, which is called data association. However, data association can be really challenging when there are large transformation between image pair. Therefore, several recent works tried to perform monocular 3D objects detection/reconstruction in 3D space, which exhibits a trend of using dense 3D grid as anchor points for aggregating multi-view deep feature. Compared to aggregating multi-view features by 2D data association, accumulating temporal vision features via 3D anchor points helps better preserve 3D geometry and doesn’t require any explicit data association. Noticed that dense 3D grid is used because no prior about scene structure is available. Although the idea is natural and effective, it overlooks the sparsity of scene structure. To ensure a good coverage of object surface, grid resolution must be high enough which leads to excessively large memory footprint. We claim that using continuous sampled points in 3D space to replace regular sampled 3D grid points is more efficient and effective. To generate continuous sampled anchor points without 3D structure prior, we propose a learnable 3D anchor points sampling method by back-projecting multi-view depth estimation. These continuous anchor points will be scattered around object surface, instead of evenly placed in the whole scene space.

In this paper, we focus on solving the problem of 3D detection with multiple posed RGB images. As mentioned above, we will demonstrate that pseudo 3D point cloud from the predicted depth maps is a good proxy for multi-view feature aggregation. This design not only saves memory but also boosts the performance of downstream tasks. Specifically, the procedure of multi-frame depth maps back-projection and point cloud sampling is termed as Point Scattering. With camera intrinsics and poses given, the scattered points will be further augmented by fetching multi-view 2D features. Additionally, in order to remedy the inaccuracy of single-view depth prediction with multi-view geometry, we propose a surface filtering module to filter out error-prone scattered points. Based on these augmented points with refined features, we adopt sparse voxel convolution to extract both geometry and visual aware features. Various downstream tasks such as detection or reconstruction can be applied on top of the extracted 3D features. In this work, we build a VoteNet [14] based 3D detector with reconstruction head to demonstrate the effectiveness of our method.

In summary, the key contributions of this paper are as follows:

We propose to aggregate multi-view feature by scatter continuous points in space, which reveal rough scene skeleton. Furthermore, the scattered points is refined by a surface filtering module to get cleaner point cloud representation.
We propose a novel 3D detection method based on multi-view feature augmented scattered points, which outperforms state-of-the-art multi-view monocular 3D detection methods by a large margin.

2 Related Works

Point cloud 3D Object Detection.

Normally, 3D detection takes single frame point cloud or depth map as input. Our work focus on 3D understanding given only a series of RGB images, which is much more challenging than point cloud object detection. [19] proposes to use sparse convolution to reduce computation and space complexity in point cloud detection, which shows the significance of keeping sparsity. Compared to other RGB fusion methods, [18] reports superior detection result of appending 2D categorical information in point-wise manner. Different from these works, our method fuse multi-frames RGB feature with sparse pseudo point cloud, and leverage multi-view stereo to enhance geometry feature.

Monocular 3D Object Detection.

Single-view 3D object detection presents a challenging task in computer vision due to inaccurate depth prediction. [15] tries to weight image feature in the frustum grid by predicting discrete depth distribution, and perform 3D detection on bird’s-eye-view (BEV) feature map. Another voxel-based method [8] proposes to learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator. Based on the 3D voxel features, its CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space. [12] assumes each in-room object has a multi-lateral relation between its surroundings, and takes all of them into account in predicting its bounding box. However, most of monocular schemes suffer from the small Field-Of-View(FoV) when capturing the large object (eg. large dining table or conference table). Besides, above voxel-based methods [15] [8] may easily cause waste of memory while building voxel grids, since very few voxels are occupied by objects. In this work we propose a more memory efficient point-based framework, which only back-project some keypoints into 3D space.

Multi-view 3D Object Detection.

How to combine information from multiple frames is one major problem of multi-view 3D understanding. To solve this, some works proposed to detect objects in each frame separately, and combine results across frames by data association. In [17] a 3D ray clustering strategy is utilized to match 2D detection results, which may fail when objects are close to each other. From the associated multi-view 2D boxes, [17] use [11] and post optimization to reconstruct 3D objects. MOLTR [7] is an unified framework for object-centric mapping from RGB videos, which consists of monocular 3D detection, multiple model Bayesian filter tracking and shape code prediction. However, same as [9], data association is achieved by tracking, continuous video frames are preferred to avoid tracking failure. More recently, researchers found that accumulating multi-view feature upon a dense 3D grid can be used to perform end-to-end 3D reconstruction. [10] is the seminal work that accumulates image feature in 3D volume along pixel rays, which enable 3D reconstruction from multi-view images. Similarly, [2] process monocular video by a transformer network that fuse observations into a volumetric scene representation. They are both limited by cubic space complexity caused by the dense grid representation. Our method avoids error-prone data association and is more scalable since we sample continuous point around surface to keep data sparsity.

3 Methodology

Figure 2: Overview. Our method predicts depth map for each RGB frame and back-project pixels within 2D boxes to 3D space(Multi-view Point Scattering). After an approximately even sampling of the scattered points, 2D feature from different views are aggregated in point wise manner to facilitate downstream task. In addition, a Surface Filter module is introduced to down-weight the error-prone scattered points.

Our method mainly consists of three modules that we will describe in sequence: monocular depth estimation, point scattering with multi-view augmentation, and point-based 3D detection. Monocular depth estimation aims to provide rough observation of partial scene. As a key contribution of this paper, continuous anchor points are generated with depth maps from different view-points and further augmented by multi-view feature. The augmented points contains geometry information captured by coordinates and feature variance across frames, as well as visual clues from multiple image frames. Taking these augmented points as cornerstone, downstream 3D understanding tasks can be applied, such as 3D detection and reconstruction. In this paper, we build a VoteNet[14] based 3D detection and reconstruction module by voxelizing the aforementioned scattered points with augmented features. Our method takes RGB images with known camera poses and instrisic parameters, as well as optional 2D detection results. 3D information is reconstructed by monocular depth estimation and multi-view stereo instead of active sensing. How to get 2D detection is beyond discussion of this paper, which can be easily obtained by any off-the-shelf 2D detectors such as [16] in practice. To isolate the performance of 2D detector, we generate 2D boxes by projecting 3D ground-truth meshes to each frame. As described in 3.2, 2D detection is used in two aspects, (1) reduce pixels being converted to 3D points. (2) append categorical information to the scattered points.

3.1 Monocular Depth Estimation

Without 3D prior of the scene, it is harder to build anchor points other than volume grid points. Therefore, we resolve to choose monocular depth estimation as a proxy for learning 3D anchor points. For depth estimation, we use ordinal regression instead of direct regression following the advanced monocular depth estimation method [6]. We simplify the network structure in [6] to reduce the amount of parameters since it works as a submodule in our end-to-end framework. We remove the Scene Understanding Layer the decoder from [6], and downsample the output resolution of depth map by factor 4. As the depth range of our major dataset ScanNet is relatively small, we apply linear discretization instead of spacing-increasing scheme for simplicity. However, the output depth value from ordinal regression is discrete. To obtain the continuous depth from ordinal depth regression, we propose a conditional depth residual estimation layer. The depth residual layer takes image feature $F$ and coarse depth estimation $D$ as input, and infer the residual $D^{^{'}}$ between ordinal depth regression and ground-truth:

D^{^{'}} = C o n v s (D, F)

(1)

Hence, combined with the average of pixelwise ordinal loss $L_{o r d i n a l}$ defined in [6] over the entire image domain, our final loss function for monocular depth estimation is:

L_{d e p t h} = L_{o r d i n a l} + | | D + D^{^{'}} - D^{*} | |

(2)

where $D^{*}$ stands for continuous ground-truth depth map.

3.2 Multi-view Points Scattering

Keypoints Generation.

Given the camera intrinsics, each predicted depth frame can be back-projected into 3D point cloud. To keep the number of points tractable with varying number of input frames, we need to discard duplicate points that are too close with existing points. Here, three strategies are adopted:

Only pixels within 2D detection boxes will be converted to point cloud. To keep the 3D distance between sampled points roughly consistent, we compute pixel sampling stride using $s t r i d e = \frac{f * r}{d}$ , where $f$ , $d$ , $r$ denote focal length, median value of depth inside 2D box and predefined minimum 3D distance between points respectively.
Before adding points from a new frame, KNN algorithm is utilized to filter out new points that are too close to any existing point.
After concatenating points from all frames, a random sampling is used to limit the maximum number of scene points.

It should be mentioned that, with the help of predicted depth map, the strategy of generating keypoints is flexible. For instance, we can sample several hypothesis points along the depth direction for each pixels. However, this will increase the number of keypoints by several times. Considering that back-projection from multiple frame is approximately a random sampling around object surface, we therefore do not sample hypothesis points along the pixel ray.

Multi-view Feature Aggregation

Inspired by [4], we augment the scattered points with multi-view image features. For a 3D point $X$ , its projections to $N$ frames are noted as $P = [p_{0}, p_{1}, . . ., p_{N - 1}] \in R^{N \times 2}$ . Feature sampling method such as bilinear interpolation can be used to take the corresponding feature for $P$ , which is $F = [f_{0}, f_{1}, . . ., f_{N - 1}] \in R^{N \times C}$ . We define an aggregation function as $f = α (F, M) \in R^{C}$ , where $M$ is the mask indicating valid projection and $η (\cdot)$ counts the number of True. Two aggregation functions are used in our method,

Mean. $f = ¯ f = \frac{1}{η (M)} \sum_{0}^{N - 1} (f_{i} \cdot M_{i})$
Variance. $f = \frac{1}{η (M)} \sum_{0}^{N - 1} M_{i} \cdot (f_{i} - ¯ f)^{2}$

Note that, essentially, this procedure share the same spirit with Cost Volume used by Deep Multi-view Stereo [20]. However, it differs in two aspects: (1) The anchor points are learnable and dynamic as they are generated by our Multi-view Points Scattering module, (2) There is no reference frame as each single frame can only observes part of the scene.

Following [18], we append one-hot categorical feature to feature map when 2D detection is available. Though instance mask can provide more accurate semantic information, 2D bounding boxes are used in our method for simplicity.

Figure 3: Amount of outliers w/o surface filter. Outliers far away from any object surface will be marked in red, while blue points represent inliers. It is shown that surface filter module helps decrease the number of outliers.

Surface Filtering.

From section 3.2, a skeleton of the scene is reconstructed, containing some noisy outliers though. Here, outliers refer to those points far away from any object surface by a predefined threshold as shown in Figure 3. In this section, we will discuss how to alleviate impact of these outliers to get a cleaner point cloud representation for downstream tasks. Again, the basic idea stems from multi-view stereo method, which assumes that if a pixel is warpped by correct depth to another frame, the appearance of these two corresponding pixels should be similar. Based on this assumption, multi-view feature variance of a point is a suitable criterion for outliers classification.

In practice, we first lift the image to feature space with three standalone convolution layers, and then aggregate mean and variance of multi-view features for each point. The image feature used by surface filter is designed to be independent of the main 2D backbone, so as to prevent interfering the training of depth estimation. The obtained point cloud features are fed into several Edge Conv layers followed by MLP layers to predict confidence score. As indicated in Figure 2, instead of completely eliminating outliers by a predefined threshold, aggregated feature of the scattered points are weighted by the predicted surface score in a soft way. Intuitively, noisy points that violates visual consistency will be down-weighted, making downstream tasks easier. During training, KNN search between the scattered points and ground-truth point cloud is used for label assignment. We use binary focal loss for pointwise classification. Let $X$ denotes the scatter points, $Y$ denotes ground-truth scene points, $p$ denote the estimated score for point $x \in X$ .

	$L_{s u r f a c e}$	$= - (1 - p_{t})^{γ} l o g (p_{t}),$		(3)
	$w h e r e, p_{t}$	$= {\begin{matrix} p, & if m i n y \in Y \| \| x - y \| \| < τ 1 - p, & otherwise \end{matrix}$		(3)

Figure 4: Qualitative result of 3D detection on ScanNet. First row: Ground-truth annotations from Scan2CAD. Second row: Our prediction results for 3D detection. Colors for categories: blue $\to$ garbage bin, red $\to$ chair, green $\to$ table, purple $\to$ cabinet.

3.3 3D Object Detection and Reconstruction

In this section, we propose to build a 3D detector on top of the final multi-view features augmented points. Different from traditional point cloud 3D detection, the pseudo point cloud is generated by previous stages and associated with multi-view features. First, to extract 3D features, we voxelize the augmented scattered points to feed them into a sparse convolution U-Net backbone. Then, a VoteNet-based 3D detector is applied to infer 3D object oriented bounding boxes in the scene. Following convention, 3D bounding box is represented as $[x, y, z, w, h, d, r_{z}]$ . $[x, y, z]$ is the offset from proposal center to ground-truth object center, $[w, h, d]$ indicates object size, $r_{z}$ is the rotation around $z$ -axis (assuming zero roll and pitch). In addition, object category is predicted by another classification head. The detection losses are the same as in [14].

As a bonus, a reconstruction head can be added in parallel with the detection heads. The reconstruction head shares the same center proposal feature. In order to (1) decouple object pose from shape and (2) get complete object mesh with partial observation, the reconstruction head is only required to estimate a shape code of a pretrained 3D mesh decoder. Here we choose [13] pretrained on ShapeNet[3] as the mesh decoder. The ground-truth shape code is generated offline by the mesh decoder for every instance in our training data, which helps the reconstruction network learn faster. During training, the discrepancy between GT shape code and prediction is minimized. Besides the shape code loss $L_{c o d e} = | |^c - c | |^{2}$ , chamfer distance between target and prediction as well as several regularization terms are also considered. Thus, the overall reconstruction loss is defined as: $L_{r e c o n} = L_{c o d e} + L_{e r r o r} + L_{C D} + L_{e d g e} + L_{n o r m a l} + L_{s m o o t h n e s s}$ . For the detailed definition, please refer to the original paper[13].

4 Experiments

4.1 Dataset and Metrics

We conduct experiments on ScanNet(V2) [5] with Scan2CAD [1] annotations to demonstrate effectiveness of the proposed method. ScanNet is an indoor dataset with ground-truth image poses, mesh reconstruction and 3D instance segmentation labels, however no 3D box annotation is available. Following [17], the orientated box annotations in [1] are used in our experiments. Other data such as camera intrinsic and extrinsic remain the same. We follow the standard training/validation splits(1201 and 312 scans respectively) provided by the benchmark.

Mean Average Precision (mAP) is commonly used for evaluating 3D object detection. mAP is computed by averaging AP(Average Precision) through semantic classes. AP of each semantic class is adopted as our evaluation metrics. Recall of each class will also be reported.

Chamfer Distance and F-Score is applied as reconstruction metrics. $G$ and $R$ denote ground-truth point set and reconstructed point set. Chamfer Distance is formulated as,

L_{C D} = \frac{1}{N_{G}} m i n p \in G | | p - q | |^{2} + \frac{1}{N_{R}} m i n q \in R | | p - q | |^{2}

(4)

And F-Score is defined as,

	$F - S c o r e (d) = \frac{2 P (d) R (d)}{P (d) + R (d)}$		(5)
	$P (d) = \frac{1}{N_{R}} \sum q \in R [m i n p \in G \| \| p - q \| \|^{2} < d]$
	$R (d) = \frac{1}{N_{G}} \sum p \in G [m i n q \in R \| \| p - q \| \|^{2} < d]$

4.2 Implementation Details

Although our method can be trained end-to-end, we observed that it takes longer time to converge due to the mutual adaptation of different modules. Therefore, the training is divided into three stages: (1) Train Monocular Depth Estimation; (2) Train Depth Estimation and Surface Filter Module from pretrained model of stage-1; (3) Train the whole network with pretrained weights from stage-2. As it is redundant to feed all frames to the network, keyframes are chosen based on a strategy that prefers frames with 2D detection and adequate ego-motion. The number of keyframes are 32 and 50 for training and validation respectively. Adam optimizer with learning rate $5 \times 10^{- 4}$ is used for all experiments.

4.3 Evaluation Results

Method	Chair AP		Table AP		Monitor AP
Method	@0.5	@0.25	@0.5	@0.25	@0.5	@0.25
FroDO	0.32	-	0.06	-	0.04	-
MOLTR	0.39	-	0.06	-	0.10	-
Total3D	0.35	0.45	0.12	0.42	0.04	0.16
MDR	0.20	0.54	0.05	0.22	0.02	0.06
Ours	0.57	0.81	0.33	0.77	0.05	0.30

Table 1: 3D detection comparison with previous methods on ScanNet-V2 val split. Our methods sample 50 frames for each scan. Our results reported here are obtained by training a model for each category separately.

Category	AP@0.25	R@0.25	AP@0.5	R@0.5
Chair	0.76	0.89	0.51	0.62
Table	0.67	0.83	0.29	0.41
Monitor	0.23	0.50	0.05	0.13
Sofa	0.52	0.81	0.24	0.39
Bed	0.76	0.91	0.52	0.57
Trashbin	0.33	0.51	0.11	0.13
Bathtub	0.30	0.47	0.14	0.16
Bookshelf	0.40	0.60	0.11	0.22
Cabinet	0.47	0.68	0.25	0.38
mean	0.49	0.69	0.25	0.33

Table 2: 3D detection result of 9 categories on ScanNet-V2 val split. R@

t

stands for Recall with IoU threshold=

t

. All 9 categories are trained with a single model.

Figure 5: Reconstruction results of chairs on ScanNet. Left: ground-truth mesh scans for reference. Right: reconstruction results.

As far as we know, [17][7] are the only two multi-frame methods that report metrics of the same Scan2CAD labels. To compare with more other existing methods in the same setting, we extend two single-frame schemes [12][8] to fuse multi-frame predictions. During inference, we estimate 3D detection for every single frame, and then assemble all detection in the same scene coordinate. After that, non-maximum suppression (NMS) is applied with an IoU threshold of 0.01. To align evaluation metrics with FroDO[17] and MOLTR[7], the 11-points AP is calculated in all of our experiments.

As shown in Table 1, our method outperforms FroDO[17] by a margin of 0.24 AP on chair class, without time consuming non-linear optimization used in FroDO. Due to the limitation of single frame FoV, MOLTR[7] and other methods that combine 3D detection from each individual frame perform worse than our method, especially those large objects such as tables. Meanwhile, we train on all of 9 categories with a single model, and evaluation results are shown in Table 2.

To measure reconstruction performance of our pipeline, (1) only those boxes that has IoU $\textgreater0.25$ with any ground-truth box are kept. (2) 2048 points are randomly sampled from both predicted and ground-truth mesh. The ground-truth meshes are taken from [13], where only chair category is available. For F-score, distance threshold is set to 0.004. The results are shown in Table 4. From 4 we find that when we train multiple categories within a single model, the corresponding AP score for most categories are slightly lower than those in Table 1 where we train a model for each separate category, but still higher than other methods we compare with.

Some qualitative results for both 3D detection and reconstruction can be found in Figure 4 and 5 respectively. As shown by the pictures, the reconstructed meshes are correctly located and akin to the ground-truth objects.

4.4 Ablation Studies

Point Sampling	Multi-view Feature	Surface Filter	AP
GS	✓	✗	0.33
PS	✗	✗	0.34
PS	✓	✗	0.56
PS	✓	✓	0.57

Table 3: Ablation study. We compare the metrics of Chair with/without key components of our method. GS means Grid Sampling, while PS refers to Point Scattering

Category	CD	F-score
chair	0.0100	74.33

Table 4: Reconstruction result on ScanNet-V2 val split.

In order to validate the effectiveness of components in our method, we do several ablation experiments on ScanNet. All reported metrics are evaluated on chair class with AP@IoU=0.5.

Dense Grid vs. Scattered Points

We compare two different point sampling method by keeping other component as much as possible. Nevertheless, three changes are made for $G S$ . (1) As using dense grid lead to significantly larger memory footprint, we have to increase the voxel size from 0.04m to 0.16m. (2) Obviously, without multi-view feature, regular grid points carry no information for 3D detection. Therefore, multi-view feature is kept when using $G S$ sampling. (3) We empirically found that $G S$ needs much more proposals due to the Farthest Point Sampling in VoteNet’ proposal network. When using same number of proposals(1024) as $P S$ , $G S$ only obtains 0.16 AP. Thus, number of proposals is increased to 8192. From the first two row of Table 3, using PS slightly outperforms dense GS even with no multi-view feature and less proposals. By using multi-view feature, $P S$ achieves much better performance than $G S$ .

Multi-view Feature augmentation.

As shown in Table 3, naively using back-projected points without augmenting them with multi-view feature deteriorates the performance significantly. The experimental result indicates that the multi-view feature contains rich semantic information.

Surface Filter.

Although multi-view feature augmented points may capture multi-view geometry implicitly, by modeling it explicitly leads to further improvement for 3D detection task. Surface filter module down-weights those points that go against multi-view geometry to reduce noise for subsequent module.

5 Conclusion

In this paper, we present an end-to-end method for 3D object detection and reconstruction with posed monocular images. The key novelty lies in the way of constructing the anchor points. By sampling from multi-view depth prediction, the dynamic anchor points are scattered around object surface. Next, these scattered points are further augmented by multi-view feature and a surface filter module to served as input of 3d detector. Empirical result shows that the proposed method achieves superior performance than its counterparts.

References

[1] A. Avetisyan, M. Dahnert, A. Dai, M. Savva, A. X. Chang, and M. Nießner (2019) Scan2cad: learning cad model alignment in rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2614–2623. Cited by: §4.1.
[2] A. Bozic, P. Palafox, J. Thies, A. Dai, and M. Nießner (2021) Transformerfusion: monocular rgb scene reconstruction using transformers. Advances in Neural Information Processing Systems 34. Cited by: §2.
[3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §3.3.
[4] R. Chen, S. Han, J. Xu, and H. Su (2019) Point-based multi-view stereo network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1538–1547. Cited by: §3.2.
[5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §4.1.
[6] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §3.1.
[7] K. Li, H. Rezatofighi, and I. Reid (2021) MOLTR: multiple object localization, tracking and reconstruction from monocular rgb videos. IEEE Robotics and Automation Letters 6 (2), pp. 3341–3348. Cited by: §2, §4.3, §4.3.
[8] F. Liu and X. Liu (2021) Voxel-based 3D Detection and Reconstruction of Multiple Objects from a Single Image. Cited by: §2, §4.3.
[9] J. Luiten, T. Fischer, and B. Leibe (2019) Track to reconstruct and reconstruct to track. IEEE Robotics and Automation Letters, pp. 1803–1810. External Links: ISSN 23318422 Cited by: §2.
[10] Z. Murez, T. van As, J. Bartolozzi, A. Sinha, V. Badrinarayanan, and A. Rabinovich (2020) Atlas: end-to-end 3d scene reconstruction from posed images. arXiv preprint arXiv:2003.10432. Cited by: §2.
[11] L. Nicholson, M. Milford, and N. Sünderhauf (2018) Quadricslam: dual quadrics from object detections as landmarks in object-oriented slam. IEEE Robotics and Automation Letters 4 (1), pp. 1–8. Cited by: §2.
[12] Y. Nie, X. Han, S. Guo, Y. Zheng, J. Chang, and J. J. Zhang (2020) Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 52–61. Cited by: §2, §4.3.
[13] J. Pan, X. Han, W. Chen, J. Tang, and K. Jia (2019) Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9964–9973. Cited by: §3.3, §4.3.
[14] C. R. Qi, O. Litany, K. He, and L. J. Guibas (2019) Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286. Cited by: §1, §3.3, §3.
[15] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander (2021) Categorical Depth Distribution Network for Monocular 3D Object Detection. Cited by: §2.
[16] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §3.
[17] M. Runz, K. Li, M. Tang, L. Ma, C. Kong, T. Schmidt, I. Reid, L. Agapito, J. Straub, S. Lovegrove, et al. (2020) FroDO: from detections to 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14720–14729. Cited by: §2, §4.1, §4.3, §4.3.
[18] S. Vora, A. H. Lang, B. Helou, and O. Beijbom (2020) Pointpainting: sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4604–4612. Cited by: §2, §3.2.
[19] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §2.
[20] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §3.2.