FusionPortable: A Multi-Sensor Campus-Scene Dataset for Evaluation of Localization and Mapping Accuracy on Diverse Platforms

Jianhao Jiao

^{2, 4 *}

, Hexiang Wei

^{2 *}

, Tianshuai Hu

^{2 *}

, Xiangcheng Hu

^{2 *}

, Yilong Zhu

^{2}

, Zhijian He

^{2}

, Jin Wu

^{2}

,
Jingwen Yu

^{2, 5}

, Xupeng Xie

^{2}

, Huaiyang Huang

^{2}

, Ruoyu Geng

^{2}

, Lujia Wang

^{2, 4}

, Ming Liu

^{1, 2, 3}

^{*}

Equal contribution.

^{1}

The Hong Kong University of Science and Technology (Guangzhou), Nansha, Guangzhou, 511400, Guangdong, China.

^{2}

The Hong Kong University of Science and Technology, Hong Kong, China. jjiao@connect.ust.hk, eelium@ust.hk.

^{3}

HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen, China.

^{4}

Clear Water Bay Institute of Autonomous Driving, Hong Kong, China.

^{5}

Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China.This work was supported by Zhongshan Science and Technology Bureau Fund, under project 2020AG002, Foshan-HKUST Project no. FSUST20-SHCIRI06C, and the Project of Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone(HZQB-KCZYB-2020083), awarded to Prof. Ming Liu.

Abstract

Combining multiple sensors enables a robot to maximize its perceptual awareness of environments and enhance its robustness to external disturbance, crucial to robotic navigation. This paper proposes the FusionPortable benchmark, a complete multi-sensor dataset with a diverse set of sequences for mobile robots. This paper presents three contributions. We first advance a portable and versatile multi-sensor suite that offers rich sensory measurements: 10Hz LiDAR point clouds, 20Hz stereo frame images, high-rate and asynchronous events from stereo event cameras, 200Hz inertial readings from an IMU, and 10Hz GPS signal. Sensors are already temporally synchronized in hardware. This device is lightweight, self-contained, and has plug-and-play support for mobile robots. Second, we construct a dataset by collecting 17 sequences that cover a variety of environments on the campus by exploiting multiple robot platforms for data collection. Some sequences are challenging to existing SLAM algorithms. Third, we provide ground truth for the decouple localization and mapping performance evaluation. We additionally evaluate state-of-the-art SLAM approaches and identify their limitations. The dataset, consisting of raw sensor measurements, ground truth, calibration data, and evaluated algorithms, will be released.

I Introduction

I-a Motivation

Multi-sensor fusion for robust perception is fundamental to various robotic applications. Different sensors can complement each other, and thus the system’s perception capability is enhanced with sensor fusion. Over the past decades, research on multi-sensor SLAM has made substantial progress. High-quality open datasets, which are collections of multi-sensor data and provide a suite of benchmark tools, significantly contribute to this advancement. On one hand, these datasets can waive inhibitive requirements on budget and workforce, such as system integration calibration and field operations. On the other hand, they investigate the advantages and limitations of current SLAM solutions and elaborately design practical, but challenging sequences [pomerleau2012challenging, wang2020tartanair]. Several of them also introduce novel sensors and indicate future research opportunities [mueggler2017event]. Researchers can easily develop, validate, and rank their algorithms with others, thus accelerating the breakthroughs. However, existing datasets were mostly collected with a single data collection platform or simplified sensor configuration. Researchers may only utilize limited sensors to develop algorithms that has a risk of over-fitting to a benchmark. Hence, we consider that a desirable dataset should fulfill the following four requirements.

Various sensors are required, making it possible to explore novel approaches to utilize them jointly.
Algorithm evaluation should be fairely conducted on various mobile robots. These robots perform different motion patterns that may challenge several SLAM algorithms’ assumptions.
Sequences have to cover from room-scale (meter-level) to large-scale (kilometer-level) environments to evaluate algorithms’ scalability.
Ground-truth trajectories and 3D maps are required to evaluate algorithms’ localization and surface reconstruction accuracy, respectively.

Mocap: Motion capture system. LT: Laser tracker.
Dataset	Platform	Environment	Sensor					GT Pose	GT Map
Dataset	Platform	Environment	IMU	GPS	LiDAR	Frame Cam.	Event Cam.	GT Pose	GT Map
UZH-Event [mueggler2017event]	Handheld	In/Outdoors	✓				✓	Mocap
ETH-EuRoc [burri2016euroc]	MAV	Indoors	✓			✓		Mocap/LT	Nova MS50
TUM VI [schubert2018tum]	Handheld	In/Outdoors	✓			✓		Mocap
MIT DARPA [huang2010high]	Car	Urban	✓	✓	✓	✓		GPS/INS
KITTI [geiger2013vision]	Car	Urban	✓	✓	✓	✓		RTK-GPS/INS
Oxford RobotCar [maddern20171]	Car	Urban	✓	✓	✓	✓		GPS/INS/SLAM
UrbanLoc [wen2020urbanloco]	Car	Urban	✓	✓	✓	✓		GPS/INS
Newer College [ramezani2020newer]	Handheld	Outdoors	✓		✓	✓		6DoF ICP	BLK $360$
NCLT [carlevaris2016university]	UGV	In/Outdoors	✓	✓	✓	✓		RTK-GPS/SLAM
M2DGR [yin2021m2dgr]	UGV	In/Outdoors	✓	✓	✓	✓	✓	RTK-GPS/Mocap/LT
MVSEC [zhu2018multivehicle]	Handheld/UAV/Motorcycle/Car	In/Outdoors	✓	✓	✓	✓	✓	Mocap/SLAM
Ours (FusionPortable)	Handheld/Quad. Robot/UGV	In/Outdoors	✓	✓	✓	✓	✓	Mocap/RTK-GPS/6DoF NDT	BLK $360$

TABLE I: Comparison with previous datasets on data-acquisition platform, environment, sensor type, and ground-truth method.

I-B Contributions

There appears to be an absence of compatible public datasets that satisfy these requirements, motivating us to propose a new SLAM benchmark.

This paper proposes the FusionPortable benchmark, a novel multi-sensor dataset with a set of sequences from diverse environments. Our contributions are presented three-fold. First, a portable and versatile multi-sensor device is elaborately manufactured. Two RGB frame cameras are mounted on the left and right side, one high-frequency and high-precision IMU is mounted internally, and one RTK-GPS is installed on the top position. Moreover, thanks to current progress in sensory technology, both novel event cameras and high-resolution 3D LiDAR are available. Thus, we also integrate them with our sensor rig and investigate their performance. All these sensors are mounted on the same rigid aluminum-alloy-based parts. Thus, their spatial relation has a tiny dynamic deviation. The complete device has its own clock synchronization unit, processor, and battery, thus self-contained. Since its size, weight, and extensibility (see Fig. 1) are satisfying, we advance that it would be a plug-and-play support to various mobile robots.

Second, we install the sensor rig on various platforms ranging from the handheld mode with a gimbal stabilizer, a quadruped robot, and an autonomous vehicle in performing distinguishable motion for the dataset construction. Various structured or semi-structured environments on The Hong Kong University of Science and Technology (HKUST) campus, including the lab, garden, canteen, corridor, escalator, and outdoor road, are examined in the dataset. Also, the collected sequences present several environmental changes caused by external light, moving objects, and scene texture. These issues are challenging to SLAM algorithms.

Third, besides ground-truth poses, we also provide ground-truth maps of most indoor sequences. We consider that measuring the mapping accuracy is crucial for evaluation. We also benchmark several state-of-the-art (SOTA) SLAM systems, including two vision-based methods and four LiDAR-based approaches. To benefit the community, the dataset will be publicly released: https://ram-lab.com/file/site/multi-sensor-dataset.

Ii Related Work

There are extensive datasets for robotic perception. Here, we introduce related works with a focus on SLAM.

Several datasets were specifically designed for one type of sensor. Mueggler et al. [mueggler2017event] proposd the event camera dataset for the purpose of overcoming illumination and motion blur issues caused by frame cameras. Pomerleau et al. [pomerleau2012challenging] proposed the point cloud dataset that covers a large spectrum of environmental structures to challenge registration algorithms. Handa et al. [handa2014benchmark] promoted the research on RGB-D cameras by publishing the ICL-NUIM dataset.

Complementing vision sensors with inertial measurements, visual-inertial odometry (VIO) approaches can tremendously improve camera tracking accuracy and robustness. Relevant datasets have been reported. Burri et al. [burri2016euroc] presented the EuRoc dataset collected by a micro aerial vehicle (MAV) in an industrial environment and a room. Schubert et al. [schubert2018tum] put forward the TUM VI benchmark by collecting handheld sequences with a careful photometric calibration forwards.

The DARPA challenge has driven the development of autonomous vehicles. Huang et al. [huang2010high] presented the MIT DARPA dataset with over $90 k m$ sequence. Geiger et al. [geiger2013vision] presented the KITTI driving benchmark where diverse perception tasks are explored. There are other datasets targeting at long-term navigation [barnes2020oxford] and urban challenges [wen2020urbanloco].

The multi-sensor device and data collection platform:
(a) CAD model of the sensor rig, where axis directions are colored: red: — Fig. 1: The multi-sensor device and data collection platform: (a) CAD model of the sensor rig, where axis directions are colored: red: $X$ , green: $Y$ , blue: $Z$ . The sensor rig is rigidly mounted on (b) a gimbal stabilizer, (c) a quadruped robot, and (d) an apollo autonomous vehicle.

Several datasets were collected by handheld devices and other types of ground robots. Ramezani et al. [ramezani2020newer] collected the Newer College Dataset with a handheld device. The NCLT dataset [carlevaris2016university] facilitated the long-term SLAM research by collecting sequences in a college campus, over $147.4 k m$ traverse and $15$ months. The M2DGR dataset covers various challenging scenarios such as entering lifts and indoor-outdoor traverse [yin2021m2dgr] with a ground robot. Zhu et al. [zhu2018multivehicle] proposed a multi-vehicle dataset for event-based perception.

Table I compares existing datasets with our work. In summary, our dataset is more complete from three aspects: 1) raw and rich sensory measurements; 2) data collection on three different platforms including a legged robot; 3) ground-truth trajectories and 3D maps for algorithm evaluation.

Iii System Overview

This section introduces sensors used in our dataset and how we achieve the spatio-temporal calibration between each sensor. Fig. 1 shows the handheld device equipped with multiple sensors and how it is mounted on three data collection platforms.

Iii-a Sensor Configuration

Sensors’ characteristics can be found in Table II. We use the Intel NUC to run sensor drivers, attach timestamps of sensor messages, and record messages into ROS bags on the Ubuntu system. The PC uses an i $7$ processor, $1$ TB solid-state drive (SSD), and $64$ GB DDR4 memory. Below, we provide detailed description of these sensors.

Iii-A1 3D LiDARs

We configure the OS $1$ - $128$ LiDAR to provide accurate measurements of surrounding environments. This LiDAR has two attractive properties. First, an internal synchronized IMU outputs $100$ Hz linear accelerations and angular velocities. Second, it additionally outputs depth images, signal images, and ambient images of surroundings.

Iii-A2 Stereo Frame Cameras

Two FILR BFS-U3-31S4C global-shutter color cameras are mounted at two sides on the system, facing directly forward. They are synchronized by an external trigger and capture high-resolution images at $20$ fps. Their exposure time is set as fixed values to minimize the relative latency. Our experiments show that the average difference in timestamps of these images is below $1 m s$ .

Iii-A3 Stereo Event Cameras

Two event cameras are also configured. They possess several desirable properties: high temporal resolution, high dynamic range, and low power consumption. The cameras have a $346 \times 260$ resolution and an internal high-rate IMU output. Event cameras are synchronized using the trigger signal generated from the left camera (master) to deliver sync pulses to the right (slave) through an external wire. But there is no way to synchronize the image acquisition (around $10$ - $20 m s$ offset). To suppress the LiDAR’s laser light, both cameras are equipped with additional infrared filters. For indoor sequences, we manually set and fix the APS exposures, which helps to minimize the latency between cameras. For outdoor sequences, we use auto-exposure to avoid over- or under-exposure.

Scene Images of places of several sequences. — (a) Garden

Iii-A4 Inertial Measurement Unit

A tactical-grade STIM $300$ IMU that is rigidly mounted below the LiDAR is employed as the main inertial sensor of the system. It features a high update rate ( $200$ Hz) and low noisy and drifting measurements. Its bias Instability is around ${0.3}^{\circ} / h$ .

Iii-A5 Global Positioning Systsem

We additionally install a ZED-F9P RTK-GPS device on the top of the LiDAR. In outdoor scenes, the GPS is activated and provides accurate latitude, longitude, and altitude readings. But it may sometimes become unstable due to buildings’ occlusion.

Iii-B Sensor Calibration

We carefully calibrate intrinsics of individual sensors, extrinsics, and overall time latency between sensors in advance. We define the coordinate system of the STIM $300$ IMU as the body frame. We provide calibration data and reports in the dataset website.

Iii-B1 Clock Synchronization

We use an FPGA to generate an external signal trigger to synchronize clocks of all sensors. This can guarantee data collection across multiple sensors with minimum latency. The FPGA receives a pulse-per-second (PPS) signal from the GPS and outputs $200, 20, 10$ Hz signal to the IMU, cameras, and LiDAR, respectively. The FPGA switches to use its internal clock to enable the time synchronization in GPS-denied scenes.

Sensor

Characteristics

3D LiDAR

1

128

120 m

range@

10

Hz; FOV:

45^{\circ}

vert.,

360^{\circ}

horiz.

Image:

1028 \times 128

10

IMU: ICM

20948

100

Hz,

9

-axis MEMS, intrinsic calibrated

Frame

Camera

Stereo color cameras:

2

FILR BFS-U

3

31

4

Resolution:

1024 \times 768

, global shutter@

20

FOV:

{66.5}^{\circ}

vert.,

{82.9}^{\circ}

horiz.

Event

Camera

Stereo color event cameras:

2

DAVIS346

Resolution:

346 \times 240

; FOV:

67^{\circ}

vert.,

83^{\circ}

horiz.

IMU: MPU

6150

1000

Hz,

6

-axis MEMS, intrinsic calibrated

IMU

STIM

300

200

Hz, Bias Instability

{0.3}^{\circ} / h

, Allan Var. @

25^{\circ} C

GPS

ZED-F9P RTK-GPS@

10

Hz,

4

concurrent GNSS, L1/L2/L5 RTK

TABLE II: Sensors and characteristics

Iii-B2 Stereo Camera Calibration

Intrinsics and extrinsics of our stereo frame and event cameras are estimated using the Matlab calibration toolbox, where the pinhole camera and radial-tangential distortion model are used. We move the sensor suite before a checkerboard to collect a sequence of images. We evenly sample images as the calibration data and manually remove outliers with high reprojection errors.

Iii-B3 Camera-IMU Extrinsic Calibration

The intrinsics of IMUs are calibrated using the Allen derivation toolbox¹¹1https://github.com/ori-drs/allan_variance_ros that estimates the noisy density and random walk for gyroscope and accelerometer measurements. After that, the spatial and temporal parameters of a camera w.r.t. an IMU are obtained by the Kalibr [furgale2013unified]. Our system consists of $4$ IMUs: STIM $300$ , ICM $20948$ in the LiDAR, and two MPU $6050$ in the DAVIS346 event cameras. Thus, we calibrate the intrinsics of these IMUs, and estimate extrinsics of these sensor pairs: $⟨$ STIM $300$ , frame cameras $⟩$ , $⟨$ STIM $300$ , event cameras $⟩$ , $⟨$ left MPU $6050$ , left DAVIS346 $⟩$ , and $⟨$ right MPU $6050$ , right DAVIS346 $⟩$ .

Iii-B4 Camera-LiDAR Extrinsic Calibration

Given initial extrinsics, we further refine the camera-LiDAR extrinsics. The checkerboard is the calibration target that provides distinctive corners and boundaries for data association. We extend the work proposed by Zhou et al. [zhou2018automatic] by improving feature extraction and matching step. We instead extract the outer corners of the board from point clouds and images. The extrinsics are optimized by minimizing the distance of all corresponding corners.

Iv Dataset Description

Sample sensor measurements.
(a)-(d): images captured by the frame camera.
(e)-(f): images augmented by positive events (red) and negative events (blue).
(h)-(i): 3D point clouds of the LiDAR.
The grid size is — (a) Canteen

This section first introduces the overall features of different sequences, which stand as our basic criteria for data collection. Details are then described, including the ground truth estimation method and dataset format.

Iv-a Sequences

The collected sequences should cover various environments, lighting conditions, motion patterns, dynamic objects, etc. We categorize major characteristics of our collected sequences as follows:

Location: Environmental locations are divided into indoors and outdoors. GPS signal is available but sometimes unstable in outdoor environments.
Structure: Structured environments can mainly be explained using geometric primitives (e.g., offices or buildings), while semi-structured environments have both geometric and complex elements like trees and sundries. Scenarios like narrow corridors are structured but may cause state estimators.
Lighting Condition: Frame cameras are sensitive to external lighting conditions. Both weak and strong light may raise challenges to visual processing algorithms.
Appearance: Texture-rich scenes facilitate visual algorithms to extract stable features (e.g., points and lines), while textureless may negatively affect the performance. Also, many events are triggered in texture-rich scenes.
Motion Pattern: Slow, normal, and fast motion may be performed. Regarding mounted platforms, the handheld device performs arbitrary 6-DoF and jerky motions, the device installed on a gimbal stabilizer conducts 6-DoF but stable motions, the quadruped robot mostly performs planar but jerky motions. In contrast, the vehicle performs planar movements at a constant speed.
Object Motion: In dynamic environments, several elements are moving while the data are captured. The more time of the data capture, the more deformed the elements will be (e.g., pedestrians or cars) [pomerleau2012challenging]. In contrast, moving objects are few in static environments.

Platform

Sequence

[s]

[m]

| | ¯ ¯ ¯ v | | [m / s]

Location

Structure

Lighting

Texture

Motion

Object

GT Pose

GT Map

Handheld

canteen_night

290

270

0.93

indoors

structured

weak

rich

6-DoF

static

6-DoF NDT

Yes

canteen_day

230

250

1.09

indoors

structured

normal

rich

6-DoF

static

6-DoF NDT

Yes

garden_night

280

265

0.94

indoors

structured

weak

rich

6-DoF

static

6-DoF NDT

Yes

garden_day

170

173

1.02

indoors

structured

normal

rich

6-DoF

static

6-DoF NDT

Yes

corridor_day

572

669

1.17

indoors

structured

weak

less

6-DoF

static

6-DoF NDT

Yes

escalator_day

315

263

0.84

indoors

structured

strong

rich

6-DoF, height changes

dynamic

6-DoF NDT

Yes

building_day

599

666

1.11

indoors

structured

normal

rich

6-DoF

dynamic

6-DoF NDT

Yes

MCR_slow

48

50

1.03

indoors

semi-structured

normal

rich

6-DoF, jerky

static

OptiTrack

Yes

MCR_normal

45

52

1.26

indoors

semi-structured

normal

rich

6-DoF, jerky

static

OptiTrack

Yes

MCR_fast

34

59

1.76

indoors

semi-structured

normal

rich

6-DoF, jerky

static

OptiTrack

Yes

Quadruped Robot

MCR_slow_

00

147

26

0.18

indoors

semi-structured

normal

rich

planar, jerky

static

OptiTrack

Yes

MCR_slow_

01

127

28

0.28

indoors

semi-structured

normal

rich

planar, jerky

static

OptiTrack

Yes

MCR_normal_

00

103

48

0.54

indoors

semi-structured

normal

rich

planar, jerky

static

OptiTrack

Yes

MCR_normal_

01

95

43

0.52

indoors

semi-structured

normal

rich

planar, jerky

static

OptiTrack

Yes

MCR_fast_

00

99

48

0.56

indoors

semi-structured

normal

rich

planar, jerky

static

OptiTrack

Yes

MCR_fast_

01

121

90

0.83

indoors

semi-structured

normal

rich

planar, jerky

static

OptiTrack

Yes

Apollo

campus_road

1186

1887

1.62

outdoors

semi-structured

normal

rich

planar

dynamic

SLAM

T: Total time. D: Total distance traveled. MCR: motion capture room.

| | ¯ ¯ ¯ v | |

: Mean linear velocity.

TABLE III: Some statistics and features of each sequence

Ground-truth point cloud in color of the motion capture room, corridor, and building scenario.
Point cloud data was recorded by the Leica BLK — (a) Motion Capture Room

Table III summaries key features of each sequence, Fig. 2 shows several scene pictures, and Fig. 3 illustrates sample sensor data. The motion capture room is abbreviated as the MCR in the following sensors.

Iv-B Groundtruth Generation

Most sequences provide ground-truth poses for algorithm evaluation. In several indoor scenes, we also provide ground-truth maps of surrounding environments. The ground truth generation is detailed as follows:

Ground-truth maps: In small- or middle-scale environments, we use the Leica BLK360 laser scanner to record the structure’s high-resolution colorized 3D dense map with millimeter accuracy from multiple locations. Fig. 4 visualizes three examples.
Ground-truth poses: In the motion capture room, we use the OptiTrack system to measure the pose of the center of reflective balls at $120$ Hz with millimeter accuracy. The OptiTrack is directly connected with the same PC to record poses to minimize the time latency. The extrinsics from the balls’ center to the body frame of the sensor rig are solved by the hand-eye calibration approach. In middle-scale environments that are covered by the ground-truth maps, we employ the NDT-based 6-DoF localization [koide2019portable] to estimate LiDAR’s poses in a prior map as the ground-truth trajectory. In outdoor environments, we fuse the RTK GPS signal with LiDAR-inertial measurements to obtain accuracy trajectories based on the LIO-SAM [shan2020lio].

Iv-C Data Format and Post-Processing

Data were collected in the ROS environment. We provide both ROS bags and individual data files for better usage:

env.bag is the raw rosbag obtained from the data collection process. It can be parsed using ROS tools.
env_ref.bag is the refined rosbag where sensor data are post-processed with below steps.
data/ stores individual sensor data from the env.bag. Each data has its timestamp that can be retrieved from the timestamps.txt.
data_ref_kitti/ follows the KITTI format [geiger2013vision] to store sensor data from data/.

We have three steps to post-process the raw data to generate the env_ref.bag: 1) caused by unperfect IMUs (like the MPU $6050$ ), several missing measurements are linearly interpolated; 2) poses provided by the motion capture system are transformed into the body frame with the hand-eye calibration results; and 3) event packages are republished at around $1000$ Hz for several event-based algorithms [zhou2021event].

Unrectified RGB images are stored. Events are stored with timestamps, pixel locations, and polarity. IMU measurements are also stored with timestamps, gyroscope measurements, accelerometer measurements, and covariances. Calibration parameters are stored in yaml files.

V Experiment

As one of the applications, we can use this dataset to benchmark SOTA SLAM systems. Here, we evaluate several open-source systems with different sensor combinations and methodologies: VINS-Fusion (IMU+stereo frame cameras) [qin2019general], ESVO (stereo event cameras) [zhou2021event], A-LOAM (LiDAR-only) [zhang2014loam], LIO-Mapping (IMU+LiDAR) [ye2019tightly], LIO-SAM (IMU+LiDAR) [shan2020lio], and FAST-LIO2 (IMU+LiDAR) [xu2022fast]. Their data loaders are modified to fit our dataset format and also released. We calculate the mean absolute trajectory error (ATE) of estimated trajectories w.r.t. the ground truth. For LiDAR-based systems, we also report the mapping accuracy on two sequences by calculating the mean point-to-point error of algorithms’ maps w.r.t. the ground-truth maps.

Trajectories of the algorithms on four sequences:
MCR_fast_00, campus_road_day, garden_day, and escalator_day w.r.t. the ground truth. — (a) MCR_fast_00

Platform

Sequence

VINS-

Fusion (LC)

A-

LOAM

LIO-

Mapping

LIO-

SAM

FAST-

LIO2

Handheld

canteen_night

0.409

0.067

0.097

0.063

0.071

canteen_day

0.691

0.057

0.088

0.053

0.057

garden_night

0.328

0.567

0.242

0.254

0.205

garden_day

0.518

0.528

0.097

0.069

0.068

corridor_day

1.807

0.416

1.755

0.594

1.563

escalator_day

2.127

0.981

0.346

0.207

4.193

building_day

12.861

1.580

0.916

0.222

0.146

MCR_slow

\times

0.087

0.042

0.063

0.114

MCR_normal

0.168

0.328

0.052

0.082

0.121

MCR_fast

\times

0.416

0.099

0.117

\times

Quad. Robot

MCR_slow_

00

0.096

0.120

0.032

0.023

0.047

MCR_slow_

01

0.081

0.054

0.030

0.030

0.051

MCR_normal_

00

0.094

0.492

0.093

0.042

0.127

MCR_normal_

01

0.086

0.635

0.390

0.040

0.068

MCR_fast_

00

0.264

4.601

2.405

0.052

0.408

MCR_fast_

01

0.130

8.264

2.210

0.066

1.495

Apollo

campus_road_day

77.528

5.707

4.122

7.364

4.080

TABLE IV: Localization accuracy.

Evaluation of (a) A-LOAM’s and (b) LIO-SAM’s mapping accuracy. — (a) Corridor_day

The quantitative localization results are reported in Table IV. “LC” indicates that the loop closure module is used. “ $\times$ ” means that algorithms fail to finish the sequence. ESVO’s results are not shown here since it cannot finish all sequences. It requires events to be continuously triggered to generate reliable time surface maps for camera tracking. But all these sequences contain textureless scenarios or static motion. Its immediate results on mapping and tracking are shown in the dataset website. VINS-Fusion and FAST-LIO2 fail in some cases since they cannot initialize well at the beginning of the sequence. Without the aid of the IMU, A-LOAM cannot handle jerky and rapid motion and thus performs poorly on two MCR sequences and all sequences on the quadruped robot. Although FAST-LIO2 has a superior real-time performance based on the filter-based state estimator and efficient tree structure, it sometimes has unreliable results on several sequences. Surprisingly, LIO-SAM performs well on all quadruped robot-based sequences, even at large rotated and fast motion. The corridor_day sequence is challenging to all methods, where the scene is textureless and structureless.

We also evaluate the mapping quality of A-LOAM and LIO-SAM on the corridor_day and garden_day sequences. The distance map is in Fig. 6. The mean distance is $0.938 m$ and $0.597 m$ respectively. Especially for the corridor mapping, A-LOAM’s map has a large drift on the $z$ -axis.

Vi Conclusion

This paper presented the FusionPortable benchmark, a multi-sensor dataset from diverse campus scenes on various platforms. We advanced the self-contained and plug-and-play multi-sensor rig that significantly enhances the preception capability of mobile robots. With the release of this dataset, we intended to challenge current SLAM approaches and encouraged future research. As the future work, we plan to extend this dataset beyond the campus-scale environments.

FusionPortable: A Multi-Sensor Campus-Scene Dataset for Evaluation of Localization and Mapping Accuracy on Diverse Platforms

Abstract

I Introduction

I-a Motivation

I-B Contributions

Ii Related Work

Iii System Overview

Iii-a Sensor Configuration

Iii-A1 3D LiDARs

Iii-A2 Stereo Frame Cameras

Iii-A3 Stereo Event Cameras

Iii-A4 Inertial Measurement Unit

Iii-A5 Global Positioning Systsem

Iii-B Sensor Calibration

Iii-B1 Clock Synchronization

Iii-B2 Stereo Camera Calibration

Iii-B3 Camera-IMU Extrinsic Calibration

Iii-B4 Camera-LiDAR Extrinsic Calibration

Iv Dataset Description

Iv-a Sequences

Iv-B Groundtruth Generation

Iv-C Data Format and Post-Processing

V Experiment

Vi Conclusion

References