FusionPortable: A Multi-Sensor Campus-Scene Dataset for Evaluation of Localization and Mapping Accuracy on Diverse Platforms
Abstract
Combining multiple sensors enables a robot to maximize its perceptual awareness of environments and enhance its robustness to external disturbance, crucial to robotic navigation. This paper proposes the FusionPortable benchmark, a complete multi-sensor dataset with a diverse set of sequences for mobile robots. This paper presents three contributions. We first advance a portable and versatile multi-sensor suite that offers rich sensory measurements: 10Hz LiDAR point clouds, 20Hz stereo frame images, high-rate and asynchronous events from stereo event cameras, 200Hz inertial readings from an IMU, and 10Hz GPS signal. Sensors are already temporally synchronized in hardware. This device is lightweight, self-contained, and has plug-and-play support for mobile robots. Second, we construct a dataset by collecting 17 sequences that cover a variety of environments on the campus by exploiting multiple robot platforms for data collection. Some sequences are challenging to existing SLAM algorithms. Third, we provide ground truth for the decouple localization and mapping performance evaluation. We additionally evaluate state-of-the-art SLAM approaches and identify their limitations. The dataset, consisting of raw sensor measurements, ground truth, calibration data, and evaluated algorithms, will be released.
I Introduction
I-a Motivation
Multi-sensor fusion for robust perception is fundamental to various robotic applications. Different sensors can complement each other, and thus the system’s perception capability is enhanced with sensor fusion. Over the past decades, research on multi-sensor SLAM has made substantial progress. High-quality open datasets, which are collections of multi-sensor data and provide a suite of benchmark tools, significantly contribute to this advancement. On one hand, these datasets can waive inhibitive requirements on budget and workforce, such as system integration calibration and field operations. On the other hand, they investigate the advantages and limitations of current SLAM solutions and elaborately design practical, but challenging sequences [pomerleau2012challenging, wang2020tartanair]. Several of them also introduce novel sensors and indicate future research opportunities [mueggler2017event]. Researchers can easily develop, validate, and rank their algorithms with others, thus accelerating the breakthroughs. However, existing datasets were mostly collected with a single data collection platform or simplified sensor configuration. Researchers may only utilize limited sensors to develop algorithms that has a risk of over-fitting to a benchmark. Hence, we consider that a desirable dataset should fulfill the following four requirements.
-
Various sensors are required, making it possible to explore novel approaches to utilize them jointly.
-
Algorithm evaluation should be fairely conducted on various mobile robots. These robots perform different motion patterns that may challenge several SLAM algorithms’ assumptions.
-
Sequences have to cover from room-scale (meter-level) to large-scale (kilometer-level) environments to evaluate algorithms’ scalability.
-
Ground-truth trajectories and 3D maps are required to evaluate algorithms’ localization and surface reconstruction accuracy, respectively.
Dataset | Platform | Environment | Sensor | GT Pose | GT Map | ||||
---|---|---|---|---|---|---|---|---|---|
IMU | GPS | LiDAR | Frame Cam. | Event Cam. | |||||
UZH-Event [mueggler2017event] | Handheld | In/Outdoors | ✓ | ✓ | Mocap | ||||
ETH-EuRoc [burri2016euroc] | MAV | Indoors | ✓ | ✓ | Mocap/LT | Nova MS50 | |||
TUM VI [schubert2018tum] | Handheld | In/Outdoors | ✓ | ✓ | Mocap | ||||
MIT DARPA [huang2010high] | Car | Urban | ✓ | ✓ | ✓ | ✓ | GPS/INS | ||
KITTI [geiger2013vision] | Car | Urban | ✓ | ✓ | ✓ | ✓ | RTK-GPS/INS | ||
Oxford RobotCar [maddern20171] | Car | Urban | ✓ | ✓ | ✓ | ✓ | GPS/INS/SLAM | ||
UrbanLoc [wen2020urbanloco] | Car | Urban | ✓ | ✓ | ✓ | ✓ | GPS/INS | ||
Newer College [ramezani2020newer] | Handheld | Outdoors | ✓ | ✓ | ✓ | 6DoF ICP | BLK | ||
NCLT [carlevaris2016university] | UGV | In/Outdoors | ✓ | ✓ | ✓ | ✓ | RTK-GPS/SLAM | ||
M2DGR [yin2021m2dgr] | UGV | In/Outdoors | ✓ | ✓ | ✓ | ✓ | ✓ | RTK-GPS/Mocap/LT | |
MVSEC [zhu2018multivehicle] | Handheld/UAV/Motorcycle/Car | In/Outdoors | ✓ | ✓ | ✓ | ✓ | ✓ | Mocap/SLAM | |
Ours (FusionPortable) | Handheld/Quad. Robot/UGV | In/Outdoors | ✓ | ✓ | ✓ | ✓ | ✓ | Mocap/RTK-GPS/6DoF NDT | BLK |
Mocap: Motion capture system. LT: Laser tracker. |
I-B Contributions
There appears to be an absence of compatible public datasets that satisfy these requirements, motivating us to propose a new SLAM benchmark.
This paper proposes the FusionPortable benchmark, a novel multi-sensor dataset with a set of sequences from diverse environments. Our contributions are presented three-fold. First, a portable and versatile multi-sensor device is elaborately manufactured. Two RGB frame cameras are mounted on the left and right side, one high-frequency and high-precision IMU is mounted internally, and one RTK-GPS is installed on the top position. Moreover, thanks to current progress in sensory technology, both novel event cameras and high-resolution 3D LiDAR are available. Thus, we also integrate them with our sensor rig and investigate their performance. All these sensors are mounted on the same rigid aluminum-alloy-based parts. Thus, their spatial relation has a tiny dynamic deviation. The complete device has its own clock synchronization unit, processor, and battery, thus self-contained. Since its size, weight, and extensibility (see Fig. 1) are satisfying, we advance that it would be a plug-and-play support to various mobile robots.
Second, we install the sensor rig on various platforms ranging from the handheld mode with a gimbal stabilizer, a quadruped robot, and an autonomous vehicle in performing distinguishable motion for the dataset construction. Various structured or semi-structured environments on The Hong Kong University of Science and Technology (HKUST) campus, including the lab, garden, canteen, corridor, escalator, and outdoor road, are examined in the dataset. Also, the collected sequences present several environmental changes caused by external light, moving objects, and scene texture. These issues are challenging to SLAM algorithms.
Third, besides ground-truth poses, we also provide ground-truth maps of most indoor sequences. We consider that measuring the mapping accuracy is crucial for evaluation. We also benchmark several state-of-the-art (SOTA) SLAM systems, including two vision-based methods and four LiDAR-based approaches. To benefit the community, the dataset will be publicly released: https://ram-lab.com/file/site/multi-sensor-dataset.
Ii Related Work
There are extensive datasets for robotic perception. Here, we introduce related works with a focus on SLAM.
Several datasets were specifically designed for one type of sensor. Mueggler et al. [mueggler2017event] proposd the event camera dataset for the purpose of overcoming illumination and motion blur issues caused by frame cameras. Pomerleau et al. [pomerleau2012challenging] proposed the point cloud dataset that covers a large spectrum of environmental structures to challenge registration algorithms. Handa et al. [handa2014benchmark] promoted the research on RGB-D cameras by publishing the ICL-NUIM dataset.
Complementing vision sensors with inertial measurements, visual-inertial odometry (VIO) approaches can tremendously improve camera tracking accuracy and robustness. Relevant datasets have been reported. Burri et al. [burri2016euroc] presented the EuRoc dataset collected by a micro aerial vehicle (MAV) in an industrial environment and a room. Schubert et al. [schubert2018tum] put forward the TUM VI benchmark by collecting handheld sequences with a careful photometric calibration forwards.
The DARPA challenge has driven the development of autonomous vehicles. Huang et al. [huang2010high] presented the MIT DARPA dataset with over sequence. Geiger et al. [geiger2013vision] presented the KITTI driving benchmark where diverse perception tasks are explored. There are other datasets targeting at long-term navigation [barnes2020oxford] and urban challenges [wen2020urbanloco].
Several datasets were collected by handheld devices and other types of ground robots. Ramezani et al. [ramezani2020newer] collected the Newer College Dataset with a handheld device. The NCLT dataset [carlevaris2016university] facilitated the long-term SLAM research by collecting sequences in a college campus, over traverse and months. The M2DGR dataset covers various challenging scenarios such as entering lifts and indoor-outdoor traverse [yin2021m2dgr] with a ground robot. Zhu et al. [zhu2018multivehicle] proposed a multi-vehicle dataset for event-based perception.
Table I compares existing datasets with our work. In summary, our dataset is more complete from three aspects: 1) raw and rich sensory measurements; 2) data collection on three different platforms including a legged robot; 3) ground-truth trajectories and 3D maps for algorithm evaluation.
Iii System Overview
This section introduces sensors used in our dataset and how we achieve the spatio-temporal calibration between each sensor. Fig. 1 shows the handheld device equipped with multiple sensors and how it is mounted on three data collection platforms.
Iii-a Sensor Configuration
Sensors’ characteristics can be found in Table II. We use the Intel NUC to run sensor drivers, attach timestamps of sensor messages, and record messages into ROS bags on the Ubuntu system. The PC uses an i processor, TB solid-state drive (SSD), and GB DDR4 memory. Below, we provide detailed description of these sensors.
Iii-A1 3D LiDARs
We configure the OS- LiDAR to provide accurate measurements of surrounding environments. This LiDAR has two attractive properties. First, an internal synchronized IMU outputs Hz linear accelerations and angular velocities. Second, it additionally outputs depth images, signal images, and ambient images of surroundings.
Iii-A2 Stereo Frame Cameras
Two FILR BFS-U3-31S4C global-shutter color cameras are mounted at two sides on the system, facing directly forward. They are synchronized by an external trigger and capture high-resolution images at fps. Their exposure time is set as fixed values to minimize the relative latency. Our experiments show that the average difference in timestamps of these images is below .
Iii-A3 Stereo Event Cameras
Two event cameras are also configured. They possess several desirable properties: high temporal resolution, high dynamic range, and low power consumption. The cameras have a resolution and an internal high-rate IMU output. Event cameras are synchronized using the trigger signal generated from the left camera (master) to deliver sync pulses to the right (slave) through an external wire. But there is no way to synchronize the image acquisition (around - offset). To suppress the LiDAR’s laser light, both cameras are equipped with additional infrared filters. For indoor sequences, we manually set and fix the APS exposures, which helps to minimize the latency between cameras. For outdoor sequences, we use auto-exposure to avoid over- or under-exposure.
Iii-A4 Inertial Measurement Unit
A tactical-grade STIM IMU that is rigidly mounted below the LiDAR is employed as the main inertial sensor of the system. It features a high update rate (Hz) and low noisy and drifting measurements. Its bias Instability is around .
Iii-A5 Global Positioning Systsem
We additionally install a ZED-F9P RTK-GPS device on the top of the LiDAR. In outdoor scenes, the GPS is activated and provides accurate latitude, longitude, and altitude readings. But it may sometimes become unstable due to buildings’ occlusion.
Iii-B Sensor Calibration
We carefully calibrate intrinsics of individual sensors, extrinsics, and overall time latency between sensors in advance. We define the coordinate system of the STIM IMU as the body frame. We provide calibration data and reports in the dataset website.
Iii-B1 Clock Synchronization
We use an FPGA to generate an external signal trigger to synchronize clocks of all sensors. This can guarantee data collection across multiple sensors with minimum latency. The FPGA receives a pulse-per-second (PPS) signal from the GPS and outputs Hz signal to the IMU, cameras, and LiDAR, respectively. The FPGA switches to use its internal clock to enable the time synchronization in GPS-denied scenes.
Sensor | Characteristics | |||||
---|---|---|---|---|---|---|
3D LiDAR |
|
|||||
|
|
|||||
|
|
|||||
IMU |
|
|||||
GPS |
|
Iii-B2 Stereo Camera Calibration
Intrinsics and extrinsics of our stereo frame and event cameras are estimated using the Matlab calibration toolbox, where the pinhole camera and radial-tangential distortion model are used. We move the sensor suite before a checkerboard to collect a sequence of images. We evenly sample images as the calibration data and manually remove outliers with high reprojection errors.
Iii-B3 Camera-IMU Extrinsic Calibration
The intrinsics of IMUs are calibrated using the Allen derivation toolbox111https://github.com/ori-drs/allan_variance_ros that estimates the noisy density and random walk for gyroscope and accelerometer measurements. After that, the spatial and temporal parameters of a camera w.r.t. an IMU are obtained by the Kalibr [furgale2013unified]. Our system consists of IMUs: STIM, ICM in the LiDAR, and two MPU in the DAVIS346 event cameras. Thus, we calibrate the intrinsics of these IMUs, and estimate extrinsics of these sensor pairs: STIM, frame cameras, STIM, event cameras, left MPU, left DAVIS346, and right MPU, right DAVIS346.
Iii-B4 Camera-LiDAR Extrinsic Calibration
Given initial extrinsics, we further refine the camera-LiDAR extrinsics. The checkerboard is the calibration target that provides distinctive corners and boundaries for data association. We extend the work proposed by Zhou et al. [zhou2018automatic] by improving feature extraction and matching step. We instead extract the outer corners of the board from point clouds and images. The extrinsics are optimized by minimizing the distance of all corresponding corners.
Iv Dataset Description
This section first introduces the overall features of different sequences, which stand as our basic criteria for data collection. Details are then described, including the ground truth estimation method and dataset format.
Iv-a Sequences
The collected sequences should cover various environments, lighting conditions, motion patterns, dynamic objects, etc. We categorize major characteristics of our collected sequences as follows:
-
Location: Environmental locations are divided into indoors and outdoors. GPS signal is available but sometimes unstable in outdoor environments.
-
Structure: Structured environments can mainly be explained using geometric primitives (e.g., offices or buildings), while semi-structured environments have both geometric and complex elements like trees and sundries. Scenarios like narrow corridors are structured but may cause state estimators.
-
Lighting Condition: Frame cameras are sensitive to external lighting conditions. Both weak and strong light may raise challenges to visual processing algorithms.
-
Appearance: Texture-rich scenes facilitate visual algorithms to extract stable features (e.g., points and lines), while textureless may negatively affect the performance. Also, many events are triggered in texture-rich scenes.
-
Motion Pattern: Slow, normal, and fast motion may be performed. Regarding mounted platforms, the handheld device performs arbitrary 6-DoF and jerky motions, the device installed on a gimbal stabilizer conducts 6-DoF but stable motions, the quadruped robot mostly performs planar but jerky motions. In contrast, the vehicle performs planar movements at a constant speed.
-
Object Motion: In dynamic environments, several elements are moving while the data are captured. The more time of the data capture, the more deformed the elements will be (e.g., pedestrians or cars) [pomerleau2012challenging]. In contrast, moving objects are few in static environments.
Platform | Sequence | T | D | Location | Structure | Lighting | Texture | Motion | Object | GT Pose | GT Map | |
Handheld | canteen_night | indoors | structured | weak | rich | 6-DoF | static | 6-DoF NDT | Yes | |||
canteen_day | indoors | structured | normal | rich | 6-DoF | static | 6-DoF NDT | Yes | ||||
garden_night | indoors | structured | weak | rich | 6-DoF | static | 6-DoF NDT | Yes | ||||
garden_day | indoors | structured | normal | rich | 6-DoF | static | 6-DoF NDT | Yes | ||||
corridor_day | indoors | structured | weak | less | 6-DoF | static | 6-DoF NDT | Yes | ||||
escalator_day | indoors | structured | strong | rich | 6-DoF, height changes | dynamic | 6-DoF NDT | Yes | ||||
building_day | indoors | structured | normal | rich | 6-DoF | dynamic | 6-DoF NDT | Yes | ||||
MCR_slow | indoors | semi-structured | normal | rich | 6-DoF, jerky | static | OptiTrack | Yes | ||||
MCR_normal | indoors | semi-structured | normal | rich | 6-DoF, jerky | static | OptiTrack | Yes | ||||
MCR_fast | indoors | semi-structured | normal | rich | 6-DoF, jerky | static | OptiTrack | Yes | ||||
Quadruped Robot | MCR_slow_ | indoors | semi-structured | normal | rich | planar, jerky | static | OptiTrack | Yes | |||
MCR_slow_ | indoors | semi-structured | normal | rich | planar, jerky | static | OptiTrack | Yes | ||||
MCR_normal_ | indoors | semi-structured | normal | rich | planar, jerky | static | OptiTrack | Yes | ||||
MCR_normal_ | indoors | semi-structured | normal | rich | planar, jerky | static | OptiTrack | Yes | ||||
MCR_fast_ | indoors | semi-structured | normal | rich | planar, jerky | static | OptiTrack | Yes | ||||
MCR_fast_ | indoors | semi-structured | normal | rich | planar, jerky | static | OptiTrack | Yes | ||||
Apollo | campus_road | outdoors | semi-structured | normal | rich | planar | dynamic | SLAM | No | |||
T: Total time. D: Total distance traveled. MCR: motion capture room. : Mean linear velocity. |
Iv-B Groundtruth Generation
Most sequences provide ground-truth poses for algorithm evaluation. In several indoor scenes, we also provide ground-truth maps of surrounding environments. The ground truth generation is detailed as follows:
-
Ground-truth maps: In small- or middle-scale environments, we use the Leica BLK360 laser scanner to record the structure’s high-resolution colorized 3D dense map with millimeter accuracy from multiple locations. Fig. 4 visualizes three examples.
-
Ground-truth poses: In the motion capture room, we use the OptiTrack system to measure the pose of the center of reflective balls at Hz with millimeter accuracy. The OptiTrack is directly connected with the same PC to record poses to minimize the time latency. The extrinsics from the balls’ center to the body frame of the sensor rig are solved by the hand-eye calibration approach. In middle-scale environments that are covered by the ground-truth maps, we employ the NDT-based 6-DoF localization [koide2019portable] to estimate LiDAR’s poses in a prior map as the ground-truth trajectory. In outdoor environments, we fuse the RTK GPS signal with LiDAR-inertial measurements to obtain accuracy trajectories based on the LIO-SAM [shan2020lio].
Iv-C Data Format and Post-Processing
Data were collected in the ROS environment. We provide both ROS bags and individual data files for better usage:
-
env.bag is the raw rosbag obtained from the data collection process. It can be parsed using ROS tools.
-
env_ref.bag is the refined rosbag where sensor data are post-processed with below steps.
-
data/ stores individual sensor data from the env.bag. Each data has its timestamp that can be retrieved from the timestamps.txt.
-
data_ref_kitti/ follows the KITTI format [geiger2013vision] to store sensor data from data/.
We have three steps to post-process the raw data to generate the env_ref.bag: 1) caused by unperfect IMUs (like the MPU), several missing measurements are linearly interpolated; 2) poses provided by the motion capture system are transformed into the body frame with the hand-eye calibration results; and 3) event packages are republished at around Hz for several event-based algorithms [zhou2021event].
Unrectified RGB images are stored. Events are stored with timestamps, pixel locations, and polarity. IMU measurements are also stored with timestamps, gyroscope measurements, accelerometer measurements, and covariances. Calibration parameters are stored in yaml files.
V Experiment
As one of the applications, we can use this dataset to benchmark SOTA SLAM systems. Here, we evaluate several open-source systems with different sensor combinations and methodologies: VINS-Fusion (IMU+stereo frame cameras) [qin2019general], ESVO (stereo event cameras) [zhou2021event], A-LOAM (LiDAR-only) [zhang2014loam], LIO-Mapping (IMU+LiDAR) [ye2019tightly], LIO-SAM (IMU+LiDAR) [shan2020lio], and FAST-LIO2 (IMU+LiDAR) [xu2022fast]. Their data loaders are modified to fit our dataset format and also released. We calculate the mean absolute trajectory error (ATE) of estimated trajectories w.r.t. the ground truth. For LiDAR-based systems, we also report the mapping accuracy on two sequences by calculating the mean point-to-point error of algorithms’ maps w.r.t. the ground-truth maps.
Platform | Sequence |
|
|
|
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Handheld | canteen_night | |||||||||||||||
canteen_day | ||||||||||||||||
garden_night | ||||||||||||||||
garden_day | ||||||||||||||||
corridor_day | ||||||||||||||||
escalator_day | ||||||||||||||||
building_day | ||||||||||||||||
MCR_slow | ||||||||||||||||
MCR_normal | ||||||||||||||||
MCR_fast | ||||||||||||||||
Quad. Robot | MCR_slow_ | |||||||||||||||
MCR_slow_ | ||||||||||||||||
MCR_normal_ | ||||||||||||||||
MCR_normal_ | ||||||||||||||||
MCR_fast_ | ||||||||||||||||
MCR_fast_ | ||||||||||||||||
Apollo | campus_road_day |
The quantitative localization results are reported in Table IV. “LC” indicates that the loop closure module is used. “” means that algorithms fail to finish the sequence. ESVO’s results are not shown here since it cannot finish all sequences. It requires events to be continuously triggered to generate reliable time surface maps for camera tracking. But all these sequences contain textureless scenarios or static motion. Its immediate results on mapping and tracking are shown in the dataset website. VINS-Fusion and FAST-LIO2 fail in some cases since they cannot initialize well at the beginning of the sequence. Without the aid of the IMU, A-LOAM cannot handle jerky and rapid motion and thus performs poorly on two MCR sequences and all sequences on the quadruped robot. Although FAST-LIO2 has a superior real-time performance based on the filter-based state estimator and efficient tree structure, it sometimes has unreliable results on several sequences. Surprisingly, LIO-SAM performs well on all quadruped robot-based sequences, even at large rotated and fast motion. The corridor_day sequence is challenging to all methods, where the scene is textureless and structureless.
We also evaluate the mapping quality of A-LOAM and LIO-SAM on the corridor_day and garden_day sequences. The distance map is in Fig. 6. The mean distance is and respectively. Especially for the corridor mapping, A-LOAM’s map has a large drift on the -axis.
Vi Conclusion
This paper presented the FusionPortable benchmark, a multi-sensor dataset from diverse campus scenes on various platforms. We advanced the self-contained and plug-and-play multi-sensor rig that significantly enhances the preception capability of mobile robots. With the release of this dataset, we intended to challenge current SLAM approaches and encouraged future research. As the future work, we plan to extend this dataset beyond the campus-scale environments.