Learning 6D Pose Estimation from Synthetic RGBD Images
for Robotic Applications

Hongpeng Cao , Lukas Dirnberger ¹, Daniele Bernardini, Cristina Piazza, Marco Caccamo
Technical University of Munich
Munich, Germany
cao.hongpeng@tum.de, lukas.dirnberger@tum.de, daniele.bernardini@tum.de,
cristina.piazza@tum.de, mcaccamo@tum.de The first two authors have equal contribution.

¹footnotemark: 1

Abstract

In this work, we propose a data generation pipeline by leveraging the 3D suite Blender to produce synthetic RGBD image datasets with 6D poses for robotic picking. The proposed pipeline can efficiently generate large amounts of photo-realistic RGBD images for the object of interest. In addition, a collection of domain randomization techniques is introduced to bridge the gap between real and synthetic data. Furthermore, we develop a real-time two-stage 6D pose estimation approach by integrating the object detector YOLO-V4-tiny and the 6D pose estimation algorithm PVN3D for time sensitive robotics applications. With the proposed data generation pipeline, our pose estimation approach can be trained from scratch using only synthetic data without any pre-trained models. The resulting network shows competitive performance compared to state-of-the-art methods when evaluated on LineMod dataset. We also demonstrate the proposed approach in a robotic experiment, grasping a household object from cluttered background under different lighting conditions.

1 Introduction

Recognizing the precise translation and orientation of objects, the so-called 6D pose, is essential in many robotic applications, such as bin-picking [17], and human-robot collaboration [2].

Data-driven approaches train deep neural networks (DNNs) to predict the 6D object pose from RGB images or RGBD images, achieving promising performance. Estimating 6D poses from RGB images is challenging. Perspective ambiguities, where the appearances of the objects are similar under different viewpoints, hamper effective learning. This problem is further exacerbated by occlusions in cluttered scenarios [39]. Additionally, as in many computer vision tasks, the performance of the algorithms is vulnerable to environmental factors, such as lighting variations and cluttered backgrounds [24].

To overcome these challenges, RGBD-based 6D pose estimation algorithms leverage the additional modality from depth images, where the lighting and color-independent geometric information is presented. One way to leverage depth images is to use the depth for fine pose refinement based on the coarse pose predicted from RGB images [29, 16]. In this case, the initial poses are estimated from the RGB images using DNNs, and the depth information is used to optimize the pose with the Iterative Closest Point algorithm (ICP) to increase the accuracy. Another approach is to convert the depth image into point clouds, from which the 6D Pose is predicted [5, 9]. Due to the unstructured nature of the data, working directly on the point cloud is computationally expensive. Related work [5, 9] first employs an instance detection network to segment the target from the RGB images and crop the point cloud correspondingly. After that, point cloud networks work on the cropped point cloud to predict the 6D pose. Alternatively, the geometric features can be directly extracted from the point cloud using DNNs and merged with the RGB features [37, 18, 33, 11, 10, 19]. Typically, the features of both modalities are matched geometrically and concatenated before further processing [37, 18, 33, 11]. FFB6D [10] explores bidirectional feature fusion at different stages of learning. Similarly, [19] develops a two-stage approach by leveraging YOLO-V3 [28] and FFB6D [10] to achieve better performance.

Different approaches exist to derive the object pose from the extracted features. Direct regression uses dense neural networks to regress to the object’s pose directly [41]. While this approach allows end-to-end learning and does not require decoding the inferred pose, the optimization of the DNNs is usually difficult due to the limitation of the mathematical representation for the orientation [41]. Another common approach is the prediction of orientation-less keypoints and retrieving the pose by their geometric correspondence. [11, 10, 19] use DNNs to predict the keypoints in 3D space, and then compute the 6D pose via geometry matching on paired predicted keypoints and ground-truth keypoints.

The DNNs presented in those RGBD-based approaches are typically complex structures with a large number of parameters. The training of these deep models is non-trivial due to the high demand for training data [15]. 6D pose labeling of images is time and labor intensive, which limits the availability of datasets. On the other hand, using modern simulations to generate synthetic data for training DNNs shows great potential with low cost and high efficiency. For RGB-based approaches, [29] and [16] render 3D meshes in OpenGL to generate synthetic RGB images with random backgrounds from commonly used computer vision datasets, for example Pascal VOC [7] or MS COCO [20]. Recent RGBD approaches [11, 19, 10, 33] use image composition in RGB and only render depth for the labeled objects. Modern simulations, such as Unity or Blender, enable realistic rendering for full RGBD images, making these engines popular to generate high-quality training datasets. [9] uses data generated from Blender [6], whereas [4] relies on Unity to simulate RGBD images.

Nevertheless, the performance of models solely trained on synthetic data often deteriorates when tested on real images due to the so-called reality-gap [29, 35]. To mitigate the reality-gap, domain randomization techniques are often applied to the synthetic data [32]. Domain randomization can be applied to different aspects of image generation. Before rendering, the scene can be randomized by varying the pose of objects, backgrounds, lighting, and the environment to cover as many scenarios as possible [29, 9, 21]. After rendering, the RGB and depth images can be directly altered, for example changing image contrast, saturation or adding Gaussian blur, and color distortion [29, 16]. The depth images can be randomized by injecting Gaussian and Perlin Noise [31] to approximate the noise presented on a real camera.

In this work, we focus on overcoming the challenge of annotating real RGBD images for training 6D pose estimation algorithms in robotic applications. To avoid manual data preparation, we propose an efficient data generation pipeline to render photo-realistic RGBD images with relative pose annotations. We also introduce a collection of domain randomization techniques to mitigate the gap between real and synthetic data. Furthermore, we develop a two-stage 6D pose estimation approach by integrating YOLO-V4 [3] and PVN3D [11] to achieve 6D pose estimation in real-time for robotic applications. With the proposed data generation pipeline, our pose estimation approach can be trained from scratch using only synthetic data without any pre-trained models. The model trained on our synthetic data achieves competitive performance compared to state-of-the-art methods when evaluated on LineMod dataset [13]. We also demonstrate the effectiveness of the proposed data generation pipeline and the two-stages pose estimation approach in a robotic grasping experiment. Our contributions can be summarized as follows:

We introduce a data generation pipeline with a collection of domain randomization techniques for preparing large-scale synthetic RGBD datasets.
We propose a real-time 6D pose estimation model based on a two stage approach using YOLOV4 [3] and PVN3D [11].
We demonstrate the proposed data generation pipeline and the two-stages pose estimation approach in a robotic grasping experiment.
We also contribute the dataset and all the pretrained models used in this work to the community.

2 Method

In this section, we first introduce a data preparation pipeline for synthetic data generation and augmentation. Second, we present a two-stage approach to solve the 6D pose estimation problem in real time for robotic applications.

2.1 Synthetic data generation

In this work, the synthetic data is generated in Blender [6] by leveraging its state-of-the-art raycasting rendering functionality. To render RGBD images, a textured 3D model of the object is required, which can be derived from CAD data or collected by 3D scanning.

Image Generation Given a set of objects, we generate a separate dataset for each object of interest, with the other objects and additional unrelated objects acting as distractors. For each scene to be rendered, we randomly place the objects in the camera’s view. In order to avoid overfitting on the color during training, we recolor 25% of the distracting objects with the dominant color of the main object. Moreover, the distractors’ optical properties, such as surface roughness and reflectivity, are varied to further increase the variety of generated images. During simulation, the randomly placed distracting objects can severly occlude the main object, which makes the main object not clearly visible, resulting in invalid training data. To avoid this, we check whether the centroid of the main object is occluded, in which case, we move the occluding objects to the back of the main object.

We sample images from SUN2012 [36] to use as backgrounds in Blender. Instead of adding the backgrounds to the images after rendering, we follow the idea of image-based lighting, where the backgrounds images are physically rendered as infinite spheres around the scene and emit light. Therefore, the backdrop images affect reflections and lighting conditions in the scene. Furthermore, a random number of point lights are added to the scene with arbitrary power and position. Once the scene is constructed, Blender starts rendering to generate RGB and depth images.

Labels The segmentation mask can be directly rendered using the object ID feature in Blender. The position and rotation of all objects and the camera is known, and the ground truth transformation matrix $R t$ can be derived accordingly. The labels for each image are then saved to separate JSON files for each image.

2.2 Data augmentation

After rendering the images, we apply several augmentation techniques to mitigate the reality-gap. Transformations that would change the object’s position in the image would invalidate the ground truth labels, the exception being rotations around the central axis of the imaging sensor. Therefore, an efficient method to increase the number of training data is to rotate each image around the central axis and adjust the labels accordingly. We can efficiently multiply our training data by this method and apply the following techniques separately to each rotated image.

RGB data augmentation The synthetic RGB images are augmented by randomizing saturation, brightness, hue and contrast, sharpening and blurring. Moreover, we add Gaussian and smooth 2D Perlin noise as in [25] to each color channel to cover different environments and sensors.

Depth data augmentation The synthetic depth images rendered from simulations are noiseless and almost perfect, which is not the case for images obtained from a real depth camera, where the depth values are often inconsistent and incomplete [22]. To approximate inconsistent depth values, we introduce Gaussian noise and Perlin noise to augment synthetic depth images. Similar to [31], pixel-level Gaussian noise is added to the synthetic depth images resembling a blurring effect. Smooth Perlin noise has been shown to significantly increase performance when learning from synthetic depth data [30]. We create Perlin noise with random frequency and amplitude and add it directly to the depth channel. The introduced Perlin noise shifts each depth point along the perceived Z-axis, resulting in a warped point cloud, similar to the observed point clouds of real depth cameras. In real RGBD images, a misalignment can be observed between depth and RGB images. Similar to [38], we use Perlin noise again to additionally warp the depth image in the image plane. Instead of using a 3D vector field to warp the entire depth image, we restrict warping to the edges of the objects. We apply a Sobel filter to detect the edges and obtain edge masks. We then shift the pixels on the edges using a 2D vector field generated using Perlin noise.

The rendered depth images have no depth information where there is no 3D model, resulting in large empty areas between objects. However, it is also very important to simulate plausible depth values for the background [30].

The background depth is based on a randomly tilted plane, to which we add a random Gaussian noise. The noise is sampled on a grid over the image and then interpolated. An additional Gaussian noise is sampled from a second grid and again interpolated. Due to the random and independent choices of grid sizes and interpolation for the two grids, we can achieve a wide variety of depth backgrounds. By adding an appropriate offset, we guarantee that the artificial background is in close proximity to the main object; hence, making object segmentation from the background more difficult. The artificial depth background then replaces empty depth pixels in the original synthetic depth image.

In the real depth images, some regions might miss the depth values and are observed as holes due to strong reflections of the object or other limitations of the depth sensor [22]. To simulate the missing depth problem, we first generate a random 2D Perlin noise map, which is converted to a binary masking map based on a threshold. This binary masking map is then used to create missing regions in the synthetic depth image. While this method is not an accurate simulation, we found this approximation, in combination with the other augmentation strategies, useful to improve the accuracy of the neural network.

2.3 A two-stages 6D pose estimation approach

Figure 1: A two-stage pose estimation approach shows the object detection with YOLO-tiny to localize the object of interest at the first stage, followed by the 6D object pose estimation with PVN3D-tiny at the second stage.

The goal of 6D pose estimation is to estimate the homogeneous matrix $R t \in S E (3)$ , which transforms the object from its coordinate system to the camera’s coordinate system. This transformation matrix consists of a rotation $R \in S O (3)$ and the translation $t \in R^{3}$ of the target object. In this work, we use PVN3D [11] to infer the homogeneous matrix $R t$ on the cropped region of interest (ROI) identified by a YOLO-V4-tiny [3] object detector. This two-stage approach is shown in Figure 1.

The RGB image is processed at the first stage using YOLO-V4-tiny, which provides several candidate bounding boxes and confidence scores. The bounding box with the highest confidence score for a specific object determines the ROI. Given the ROI, the cropped area is the smallest square centered on the ROI and including it, that is a multiple of the PVN3D input size (e.g. 80 x 80, 160 x 160, …). The square cropped images are then resized to 80 x 80 using nearest neighbor interpolation.

Following PVN3D[11] and PointNet++[26], the point cloud is enriched by appending point-wise R, G, B values and surface normals. We estimate the surface normal vectors by calculating the depth image’s gradients and the pixel-wise normals geometrically as in [14]. Differently from the original PVN3D [11] implementation, where the nearest neighbor approach is used to compute the normals from unstructured point clouds, calculating normals from structured depth image is more computationally efficient [23]. This also allows us to use a GPU-based gradient filter in TensorFlow. The resulting point cloud is then randomly subsampled to increase computational efficiency.

In the second stage, PVN3D is used for the pose estimation, with PSPNet [40] and PointNet++ [26] as backbones to extract RGB and point cloud features separately. The extracted latent features are then fused by DenseFusionNet [33] at pixel level. Because of the resizing of the cropped RGB image, we map the resized features back to the nearest point in the point cloud. Shared MLPs are then used to regress to the point-wise segmentation and keypoints offsets ${o f_{i}} \in R^{3}$ .

To obtain the final object pose, the point-wise segmentation filters out background points and the keypoint offset are added to the input point cloud to get keypoint candidates. In [11], keypoint candidates are clustered by using Mean-Shift clustering for the final voted keypoints ${^kpi}∈R3$ . However, the Mean-Shift algorithm works iteratively and this prevents an efficient GPU implementation with deterministic execution time. To make the keypoint voting temporally deterministic, we first select a fixed amount of point cloud points for each keypoint with the smallest predicted offset. Compared to random sampling, this selection method already removes those outliers that show a high offset. To eliminate any further outliers, we filter out any keypoint candidate whose distance to the mean prediction $μ$ exceeds the standard deviation $σ$ , i.e. the offsets $o f_{i}$ will be masked out if $| o f_{i} - μ | > σ$ . After removing outliers, we apply global averaging on ${x, y, z}$ axis to obtain the voted keypoints ${^kpi}∈R3$ . We use singular value decomposition (SVD) to find the SE(3) transformation matrix $R t$ between the predicted keypoints ${^kpi}$ and the reference model keypoints ${k p_{i}}$ .

The prediction accuracy is improved by cropping the image to the ROI, as only the relevant part of the data is processed. With the same number of sampling points, the sampled point cloud from the cropped image is denser, providing PointNet++ with richer geometric information for feature processing, which can also be observed in [19]. Given the cropped input, we could build the PVN3D with only about 8 millions parameters, which is approximately 15% of the original implementation [12]. In our test on the LineMOD dataset, the reduced PVN3D performs similarly to the original model. We refer to the reduced PVN3D model as PVN3D-tiny.

3 Experimental Setup

In this section, we study the effectiveness of the proposed synthetic data preparation pipeline, and the two-stages 6D pose estimation algorithm on the benchmark dataset LineMOD [13]. Specifically, we use the 3D models of the LineMOD dataset to generate synthetic RGBD data for training the proposed algorithm and use annotated real images to evaluate its performance.

3.1 Synthetic data preparation

In this work, we are interested in the 6D pose estimation of a single object in cluttered environments. For each LineMOD object, we render scenes in Blender to generate 20k synthetic images using the provided 3D model. We then augment the rendered dataset by rotating each image around the center of the image 16 times, resulting in around 300k images. We discard the images where the object is out of view and pad the images with zeros for empty areas after rotation. Additionally, each image is augmented offline by applying the domain randomization techniques introduced in section 2. For the training of PVN3D-tiny, we first crop the RGBD to obtain the region of interest according to the ground truth bounding boxes. Furthermore, we generate the point-wise semantic masks and keypoints offsets from the cropped RGBD images using the ground truth poses and segmentation masks.

3.2 Synthetic data inspection

To quantify the reality-gap between synthetic and real data, we sample 50 RGBD images from the synthetic and real dataset and compare global statistics of these two subsets. For RGB images, we compute the average and the standard deviation for brightness and saturation, as we qualitatively observed that these two factors have a strong influence on the appearances of the generated data. By comparing the statistic of brightness on the synthetic and real subsets, we can optimize the average power and randomization of the point lights and the light-emitting background in Blender. Similarly, with the statistics of saturation, we can optimize the color management in the Blender. To study the statistics of depth images, we use the average power spectral density (PSD) and compare the average distribution on frequencies, as shown in Figure 3. Studying PSD on frequencies allows us to inspect the structures of the depth images. And we can accordingly adjust the frequency of the Perlin noise used for depth augmentation. It can be seen that the augmented depth images are closer in frequency distribution to the real images than the non-augmented ones. This is an indication that depth augmentation reduces the gap between the synthetic and real data.

The study of the global statistics for RGBD images is fairly cheap since no real annotations are required. The study also enables us to identify the reality-gap qualitatively and adjust the data generation parameters accordingly, for instance, brightness, depth frequencies, etc., to bring the synthetic data closer to the real data. Figure 2 shows some examples of real and synthetic images.

Figure 3: The plot showing the qualitative average power spectral density (PSD) of depth images with respect to frequencies for the class “cat” over 50 randomly sampled images.

3.3 Implementation

The synthetic data generation pipeline is implemented in Python using Blender’s API. The data randomization and preprocessing are implemented using TensorFlow, accelerating the processing with GPUs. As for the two-stage 6D pose estimation approach, we use the original Darknet implementation [3] of YOLO-V4-tiny for the object detection at the first stage and PVN3D-tiny, implemented in TensorFlow, in the second stage.

3.4 Training and evaluation

We separately train a binary YOLO-V4-tiny model and a PVN3D-tiny model for each object of the LineMOD dataset. The YOLO-V4-tiny model is trained using the Darknet framework [27], and PVN3D-tiny is trained in TensorFlow [1]. All deep neural networks are trained from scratch using only synthetic data without any pretrained models. After training, we build the two-stage 6D pose estimation pipeline by combining YOLO and PVN3D. We follow [11] to evaluate the 6D pose estimation performance on the annotated real images provided in LineMOD. The 6D pose estimation performance is measured by using ADD(S) metrics [13]. ADD measures the average distance between the ground truth point cloud and the point cloud transformed with predicted $R, t$ , which can be defined as follows:

A D D = \frac{1}{m} \sum v \in O ∥ (R v + t) - (R^{*} v + t^{*}) ∥,

(1)

where $m$ is the number of the sampled points, $R^{*}, t^{*}$ is the ground truth pose, and $v \in R^{3}$ denotes a vertex from the object $O$ . Similarly, the ADDS metric measures the average minimum distance between two point clouds as:

A D D S = \frac{1}{m} \sum v 1 \in O min v 2 \in O ∥ (R v + t) - (R^{*} v + t^{*}) ∥,

(2)

Compared to ADD, ADDS measures the distance to the nearest point instead of correspondent mesh points. For symmetrical objects, ADDS is better suited because ADD yields low scores if the object’s pose is different to the ground truth, even if the pose corresponds to an invariant rotation. The success rates on test images are used to quantify the pose estimation performance. A threshold of $10 %$ of the object’s diameter is typically used to classify a prediction as successful or not.

4 Results

In this section, we report the performance of the proposed two-stage 6D pose estimation algorithm on the LineMOD dataset after training on the synthetic data. We also apply the proposed data generation pipeline and 6D pose estimation algorithm for a household object and demonstrate the 6D pose estimation in a robotic grasping experiment.

4.1 6D pose estimation accuracy

We evaluate the performance of the proposed 6D pose estimation approach on all objects from the LineMOD dataset. We report the results with comparison to the state-of-the-art work in Table 1, in which the performance of PointFusion is from [19] and performance of SSD-6D [16] is from [29]. Our approach generally has a good performance on most of the objects without pose refinement. Specially, it performs well on small objects like “ape”, “duck”, on which the SSD-6D[16] and AAE [29] are less accurate. On the other hand, our model performs less than optimal on “Lamp” and “Holepuncher”. The reason could lie in the low-quality textures of the LineMOD models. Our approach, being trained end-to-end on RGBD data, could be more sensitive to less-detailed textures compared to refinement based approaches.

Compared to the related work that only uses synthetic data for training, our approach outperforms AAE [29] and has competitive performance compared to SSD-6D[16]. Furthermore, [9] proposes a 6D pose estimation algorithm based on DGCNN [34] and reaches 98% average accuracy. However, it relies heavily on pose refinement, and it takes approximately one second to detect a single object. The table also shows that most of the work trained using real data outperforms the synthetic-only trained methods.

The proposed two-stage approach is limited by the object detection performance at the first stage. When tested with ground truth bounding boxes, the performance of the 6D pose estimation at the second stage is improved by a large margin, achieving approximately 94% accuracy, as presented in Table 2. This result is comparable to the state-of-the-art methods trained using real data. One potential way to improve the object detection performance is to use RGBD images [8], where the object detector can learn the more robust features from both appearances features provided by RGB images and geometry features provided by depth images.

	Real data						Synthetic Data
	PointFusion	DenseFusion $^{*}$	G2L-Net	PVN3D	FFB6D	E2EK	AAE $^{*}$	SSD-6D $^{*}$	DGCNN $^{*}$	Ours
	[37]	[33]	[5]	[11]	[10]	[19]	[29]	[16]	[9]
Ape	70.4	92.3	96.8	97.3	98.4	98.7	20.55	65	97.7	78
Benchvise	80.7	93.2	96.1	99.7	100	100	64.25	80	99.8	92
Camera	60.8	94.4	98.2	99.6	99.9	99.9	63.20	78	98.3	66
Can	61.1	93.1	98	99.5	99.8	100	76.09	86	98.8	95
Cat	79.1	96.5	99.2	99.8	99.9	100	72.01	70	99.9	97
Driller	47.3	87	99.8	99.3	100	100	41.58	73	99.2	91
Duck	63	92.3	97.7	98.2	98.4	99.4	32.38	66	97.8	89
Eggbox	99.9	99.8	100	99.8	100	100	98.64	100	97.7	91
Glue	99.3	100	100	100	100	100	96.39	100	98.9	73
Holepuncher	71.8	92.1	99	99.9	99.8	100	49.88	49	94.1	28
Iron	83.2	97	99.3	99.7	99.9	100	63.11	78	100	94
Lamp	62.3	95.3	99.5	99.8	99.9	99.9	91.69	73	92.8	40
Phone	78.8	92.8	98.9	99.5	99.7	100	70.96	79	99.1	74
All	73.7	94.3	98.7	99.4	99.7	99.8	64.67	79	98.0	77.5
Speed(s)		0.06	0.044	0.19	0.075	0.068		0.1	1.0	0.046
$^{*}$ With refinement

Table 1: The performance of 6D pose estimation on LineMOD compares to the state-of-the-art using RGBD. The bold objects are symmetric.

	Ape	Benchvise	Camera	Can	Cat	Driller	Duck	Eggbox	Glue	Holepuncher	Iron	Lamp	Phone	All
Predicted bboxes	78	92	66	95	97	91	89	91	73	28	94	40	74	77.5
GT bboxes	81	99	98	96	97	99	94	99	99	78	96	94	96	94.3

Table 2: 6D pose estimation ADD(S) scores, using predicted or ground truth bounding boxes

4.2 Running time

We evaluate the running time efficiency on a desktop of Xeon Silver-CPU(2.1GHz) and an NVIDIA Quadro RTX 8000 Graphics card. As shown in Table 3, the inference of the PVN3D-tiny takes the most of the running time and the other procedures run at similar speeds. Given 480x640 RGB and depth images, our approach takes approximately 46ms for a single object pose estimation. Compared to the other approaches listed in Table 1, our approach achieves a sufficient running time for real-time tasks and comparable or better accuracy. As outlined in the next section, we found the accuracy to be sufficient for robotic grasping tasks.

Procedure	Speed Mean/Std(ms)	Percent
YOLO-V4 Tiny	6.7/0.5	15%
Pcld Preproc.	8.2/5.9	18%
PVN3D Tiny	23.7/0.9	51%
Pose Regression	7.2/0.7	16%
All	45.8/6.2	100%

Table 3: Running time analysis of the proposed approach

4.3 6DoF pose estimation in robotic applications

We demonstrate the proposed approach on a household object, a rubber duck, as shown in Figure 4. We first use the proposed data generation pipeline to generate synthetic training data, for which the 3D model of the object is obtained using a Shining3D Transcan C 3D scanner. We train the proposed two-stage estimation approach using the synthetic data and use it for a robotic grasping task. The goal of the task is to grasp the rubber duck from a cluttered environment. The robot first estimates the 6D pose of the duck from an RGBD image and then will attempt a grasp using one of the pre-registered grasp poses. If the object can be grasped and lifted without slipping, this grasp will be regarded as a success, a failure otherwise. After each grasp attempt, the object will be deposited back to the table at a random location. Should the object be dropped in a position that allows no grasps, we skip this attempt and reset the experiment. We create three scenarios with normal, unbalanced and dark lighting conditions to test the algorithm’s robustness to different lighting levels. We repeat the grasp attempts 20 times for each test scenario.

As shown in Table 4, the robotic arm has achieved 53 out of 60 successful attempts with an approximate 88% successful rate. The three scenarios show similar success rates, showing the algorithm’s robustness to different lighting levels. Notably, the proposed algorithm works well in low-lighting conditions. The reason could be attributed to two factors: the first is that the training on domain-randomized synthetic data makes the algorithm learn more robust features; secondly, the use of depth can compensate for the blindness of the RGB when the lighting condition is poor.

Conditions	Success	Miss	Collision	Successful rate
Normal	17	1	2	85%
Unbalanced	18	2	0	90%
Low	18	2	0	90 %
All	53	5	2	88.3%

Table 4: Single object grasping experiments under different lighting conditions

5 Conclusions and Future Work

This work presents a pipeline to efficiently generate synthetic RGBD data for training 6D pose estimation algorithms. Additionally, domain randomization techniques are introduced to mitigate the gap between real and synthetic data. Furthermore, we develop a two-stage approach by integrating YOLO-V4-tiny [3] and PVN3D [11], which runs at approximately 20FPS, to solve the 6D pose estimation problem for robotic applications.

We design a sim-to-real experiment to quantitatively evaluate the effectiveness of the proposed data generation pipeline and the 6D pose estimation approach, for which we only use the synthetic data to train the algorithm and report the ADD(S) accuracy on LineMod dataset. The experiment shows that the proposed two-stage detection approach achieves competitive performance compared to state-of-the-art methods. The pipeline also shows great potential for training on synthetic data and transfer to real data.

The performance of the pipeline could be improved by using the depth also in the first stage of the object detection since the first stage has been shown to be the bottleneck for the accuracy. Moreover, since the beginning of our work other point cloud networks have been introduced. We could use one of the novel approaches to further improve the performance of the second stage.

Moreover, we demonstrate the proposed approach in a robotic experiment grasping a household object from cluttered background under different lighting conditions. The 6D pose estimation algorithm trained on synthetic data proves robust to different lighting scenarios, as it performs well even in low lighting. With the proposed approach, the grasping achieves an approximate 88% success rate, showing the feasibility of training a 6D pose estimation algorithm using only synthetic data and transferring the learned model to tackle robotic tasks in real life. In later work, we will extend the proposed approach to bin-picking where multiple instances of the same object are present. Further optimization of the scalability and more exploration in data generation will be studied in future work.

References

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
[2] Tamim Asfour, Mirko Waechter, Lukas Kaul, Samuel Rader, Pascal Weiner, Simon Ottenhaus, Raphael Grimm, You Zhou, Markus Grotz, and Fabian Paus. Armar-6: A high-performance humanoid for human-robot collaboration in real-world scenarios. IEEE Robotics & Automation Magazine, 26(4):108–121, 2019.
[3] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
[4] Steve Borkman, Adam Crespi, Saurav Dhakad, Sujoy Ganguly, Jonathan Hogins, You-Cyuan Jhang, Mohsen Kamalzadeh, Bowen Li, Steven Leal, Pete Parisi, et al. Unity perception: Generate synthetic data for computer vision. arXiv preprint arXiv:2107.04259, 2021.
[5] Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, and Ales Leonardis. G2l-net: Global to local network for real-time 6d pose estimation with embedding vector features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4233–4242, 2020.
[6] Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
[7] Mark Everingham and John Winn. The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Anal. Stat. Model. Comput. Learn., Tech. Rep, 2007:1–45, 2012.
[8] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In European conference on computer vision, pages 345–360. Springer, 2014.
[9] Frederik Hagelskjær and Anders Glent Buch. Bridging the reality gap for pose estimation networks using sensor-based domain randomization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 935–944, 2021.
[10] Yisheng He, Haibin Huang, Haoqiang Fan, Qifeng Chen, and Jian Sun. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3003–3013, 2021.
[11] Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, and Jian Sun. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11632–11641, 2020.
[12] Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, and Jian Sun. Supplementary material–pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. 2020.
[13] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pages 548–562. Springer, 2012.
[14] S. Holzer, R. B. Rusu, M. Dixon, S. Gedikli, and N. Navab. Adaptive neighborhood selection for real-time surface normal estimation from organized point cloud data using integral images. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2684–2689, 2012.
[15] Roman Kaskman, Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[16] Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE international conference on computer vision, pages 1521–1529, 2017.
[17] Kilian Kleeberger, Christian Landgraf, and Marco F Huber. Large-scale 6d object pose estimation dataset for industrial bin-picking. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2573–2578. IEEE, 2019.
[18] Chi Li, Jin Bai, and Gregory D Hager. A unified framework for multi-view multi-class object pose estimation. In Proceedings of the european conference on computer vision (eccv), pages 254–269, 2018.
[19] Shifeng Lin, Zunran Wang, Yonggen Ling, Yidan Tao, and Chenguang Yang. E2ek: End-to-end regression network based on keypoint for 6d pose estimation. IEEE Robotics and Automation Letters, 7(3):6526–6533, 2022.
[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
[21] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
[22] Tanwi Mallick, Partha Pratim Das, and Arun Kumar Majumdar. Characterizations of noise in kinect depth images: A review. IEEE Sensors journal, 14(6):1731–1740, 2014.
[23] Yosuke Nakagawa, Hideaki Uchiyama, Hajime Nagahara, and Rin-Ichiro Taniguchi. Estimating surface normals with depth image gradients for fast and accurate registration. In 2015 International Conference on 3D Vision, pages 640–647, 2015.
[24] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4561–4570, 2019.
[25] Ken Perlin. Improving noise. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 681–682, 2002.
[26] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
[27] Joseph Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016.
[28] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[29] Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the european conference on computer vision (ECCV), pages 699–715, 2018.
[30] Stefan Thalhammer, Kiru Park, Timothy Patten, Markus Vincze, and Walter G. Kropatsch. Sydd: Synthetic depth data randomization for object detection using domain-relevant background.
[31] Stefan Thalhammer, Timothy Patten, and Markus Vincze. Sydpose: Object detection and pose estimation in cluttered real-world depth images trained using only synthetic data. In 2019 International Conference on 3D Vision (3DV), pages 106–115. IEEE, 2019.
[32] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
[33] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3343–3352, 2019.
[34] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
[35] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. 2018.
[36] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
[37] Danfei Xu, Dragomir Anguelov, and Ashesh Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 244–253, 2018.
[38] Sergey Zakharov, Benjamin Planche, Ziyan Wu, Andreas Hutter, Harald Kosch, and Slobodan Ilic. Keep it unreal: Bridging the realism gap for 2.5d recognition with geometry priors only. CoRR, abs/1804.09113, 2018.
[39] Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. Dpod: 6d pose object detector and refiner. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1941–1950, 2019.
[40] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[41] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.

Learning 6D Pose Estimation from Synthetic RGBD Images for Robotic Applications