Monocular Camera-based Complex Obstacle Avoidance via Efficient Deep Reinforcement Learning

Jianchuan Ding^$†$, Lingping Gao^$†$, Wenxi Liu, Haiyin Piao, Jia Pan, Zhenjun Du, Xin Yang^$*$, Baocai Yin J. Ding is with the School of Computer Science, Dalian University of Technology, Dalian 116024, China, and also with Hebei University of Water Resources and Electric Engineering, Cangzhou, 061016, China. (e-mail:djc_dlut@mail.dlut.edu.cn).L. Gao is with the School of Computer Science, Dalian University of Technology, Dalian 116024, China, and also with Alibaba Group, Hangzhou, 310000, China. (e-mail:gaolingping.glp@alibaba-inc).W. Liu is with the College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, China (e-mail: wenxi.liu@hotmail.com).H. Piao is with Northwestern Polytechnical University, Xi’an 710072, China (e-mail: haiyinpiao@mail.nwpu.edu.cn).J. Pan is with the Department of Computer Science, The University of Hong Kong, Hong Kong, China (e-mail: jpan@cs.hku.hk).Z. Du is with SIASUN Robot & Automation Co., Ltd, Shenyang 110168, China (e-mail: duzhenjun@siasun.com).X. Yang and B. Yin are with Dalian University of Technology, Dalian 116024, China (e-mail: xinyang@dlut.edu.cn; ybc@dlut.edu.cn).^$†$ Equal contribution. ^$*$ Corresponding author.Copyright © 2022 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.

Abstract

Deep reinforcement learning has achieved great success in laser-based collision avoidance works because the laser can sense accurate depth information without too much redundant data, which can maintain the robustness of the algorithm when it is migrated from the simulation environment to the real world. However, high-cost laser devices are not only difficult to deploy for a large scale of robots but also demonstrate unsatisfactory robustness towards the complex obstacles, including irregular obstacles, e.g., tables, chairs, and shelves, as well as complex ground and special materials. In this paper, we propose a novel monocular camera-based complex obstacle avoidance framework. Particularly, we innovatively transform the captured RGB images to pseudo-laser measurements for efficient deep reinforcement learning. Compared to the traditional laser measurement captured at a certain height that only contains one-dimensional distance information away from the neighboring obstacles, our proposed pseudo-laser measurement fuses the depth and semantic information of the captured RGB image, which makes our method effective for complex obstacles. We also design a feature extraction guidance module to weight the input pseudo-laser measurement, and the agent has more reasonable attention for the current state, which is conducive to improving the accuracy and efficiency of the obstacle avoidance policy. Besides, we adaptively add the synthesized noise to the laser measurement during the training stage to decrease the sim-to-real gap and increase the robustness of our model in the real environment. Finally, the experimental results show that our framework achieves state-of-the-art performance in several virtual and real-world scenarios.

Deep reinforcement learning, obstacle avoidance, robot vision, robot navigation.

I Introduction

Fig. 1: One-dimensional laser sensors have low robustness to the complex obstacles of certain types. As illustrated in the figure, they are (a) difficult to perceive the shape of irregular objects, (b) difficult to exploit the semantic information of complex ground scenes, and (c) also limited by some special materials. In addition, the influence of sunlight makes the depth of the RGB-D camera almost invalid.

Collision avoidance is a challenging research problem for robot navigation, which has been studied for decades and it can be applied in many important applications in real-world scenarios, e.g., unmanned driving [1] and freight logistics [38]. In recent years, deep reinforcement learning (DRL) based methods [42, 27, 6, 13] have been studied to address the collision avoidance problem in the robot system. Compared with deep learning-based methods [36, 8, 5, 57] the DRL-based methods can collect a large amount of data based on simulation without relying on manually-labeled data. Specifically, in virtual environments, the agent can be simulated to navigate, continuously interact with the environment in a ‘trial-error’ manner, and obtain the environmental feedback for learning. Therefore, in practice, their collision avoidance policies can be trained in simulation and then migrated to real robots.

As the major limitation of learning policy using simulation data, the large gap between simulation and real-world makes it difficult to directly migrate the learned policy from virtual agents to real robots. In order to reduce this gap, one feasible solution is to apply lasers to sense the surrounding environment and formulate the observation as the obstacle distance information of discretized orientations in the form of one-dimensional vector [27]. Such laser-based observation possesses the advantage of accuracy and simplicity, while it not only ignores the discrepancy of texture and color between the simulation environment and the real environment, but also contains less redundant information than image data, which benefits the training of DRL frameworks.

However, laser sensors notoriously have several shortcomings, i.e., one-dimensional laser observations cannot fully exploit the scene semantic information, and they are difficult to perceive the complex obstacles (as shown in Fig. 1, the irregular obstacles, the complex ground scenes, and special materials). Besides, its high price makes it unsuitable to deploy in large-scale industrial applications.

In this work, we aim to set up a collision avoidance framework for a real robot based on a monocular RGB camera using efficient DRL. Compared with laser sensors, RGB cameras are not only low-cost but also contain necessary semantic information for understanding the environment. However, the image data captured by RGB cameras contains some redundant information [4] (e.g., intensity, texture, etc.), which causes a large sim-to-real gap that makes DRL difficult to converge and reduces the robustness of robots in unseen scenes. Thus, to utilize the advantage of RGB cameras while reducing redundant information, we propose a novel pseudo-laser measurement generated from the single RGB image captured by a deployed monocular camera for the task of collision avoidance in the complex scene. Unlike [50] directly mapping the depth map to Pseudo-LiDAR to improve the accuracy of 3D bounding box prediction, our pseudo-laser measurement fuses the distance and semantic information of the complex objects and the scene, which thus can more effectively handle the complex obstacles than the laser-based measurement. On the other hand, this pseudo-laser measurement is represented by one-dimensional data, which can provide more concise input for DRL to narrow the sim-to-real gap and promote the effective convergence of DRL and the generalization of the model.

In particular, our approach for generating pseudo-laser measurement is composed of two stages: semantic masking and deep slicing. We first estimate the scene depth map based on the RGB image captured by a monocular camera. It is reasonable to assume that the ground or floor depth that occupies a larger area in the depth map has no substantive significance for obstacle avoidance. Previous researchers [6] usually used the method of projecting a depth map as a point cloud and then performing horizontal threshold-based cutting to remove the ground information. However, this method may fail in some complex ground scenes, e.g., water on the ground, clothes on the floor, and slopes (as shown in Fig. 1). Therefore, we exploit the semantic information of the scene to infer the traversable region, which locates these ‘special obstacles’ (e.g., clothes, water, glass, etc.) on the ground. We design the semantic mask to remove the distance information of the traversable region from the estimated depth map. In addition, we also extract the edge contours of obstacles and combine them with the later depth slicing module to improve the perception of complex obstacle shapes. In this way, the agent can fully exploit pixel-wise semantic information in the field of view (FOV), which is beneficial to make DRL efficient.

After culling out the traversable region from the depth map, we propose a depth slicing module to generate the pseudo-laser measurement. A naive way to accomplish pseudo-laser measurement is to slice the depth map along a horizontal axis. However, this approach ignores the shape of irregular obstacles. For instance, as shown in Fig. 1, if only slicing the feet of the table, the robot may collide with the upper part of it. Thus, we propose a dynamic local minimum pooling operation, which slices the depth information of each column from the semantic-depth map, performs minimum pooling for each column and composes them into a one-dimensional vector, i.e. the pseudo-laser measurement, in which each element of the vector represents the distance away from the nearest obstacle along the vertical axis.

With the pseudo-laser measurement, we adopt a DRL framework to sample the collision avoidance policy of the robot to efficiently avoid obstacles. In particular, to overcome the limitation of $90^{\circ}$ -FOV and enhance the performance of the policy, we introduce the LSTM [18] submodule to leverage the historical state of the agent to model its temporal actions. Then, we propose the Feature Extraction Guidance (FEG) module to improve the expressiveness of the pseudo-laser measurement. Generally, the state of the agent in DRL includes observation, goal, and velocity. There are two methods of state fusion as input to the DRL network: data-level fusion [6] and decision-level fusion [27]. However, they are not conducive to process the input information from different modalities, so they result in unreliable DRL policies. Thus, we design the FEG module to fully exploit the interrelationship of the input information across the channel domain and the spatial domain. Specifically, FEG weights the pseudo-laser measurement according to the destination, distance, and velocity data, then the weighted result (feature mask) is producted with the observation data. It can pay more attention to the pseudo-laser measurement that guides the current obstacle avoidance task thus the subsequent DRL framework is more purposeful. Due to the preprocessing of the input, the initial DRL stage is not so blind, which accelerates the network convergence.

Finally, to optimize the training process of DRL, we design several simulation environments with different levels of difficulty. The algorithm starts training from simple scenes, then we migrate the trained model to more complex scenes. In order to transfer our model from the simulation environment to the real world, we augment the observations collected from the simulation scenes by adding specially-designed noises similar to the real world to the training data, thereby narrowing the sim-to-real gap. We deploy 7 models in the Gazebo simulation environment, and the results show that our noise model has the highest success rate in complex environments. In addition, this model shows smooth motion and a high success rate in real scene experiments. Therefore, the data augmentation we adopted benefits the sim-to-real migration of our model.

The main contributions of our work are as follows:

We present a vision-based collision avoidance method based on deep reinforcement learning for robot navigation, which only relies on a single RGB monocular camera as the sensor.
To utilize the semantic information and share the advantage of laser sensors, we propose using RGB images to encode one-dimensional pseudo-laser measurement for collision avoidance, which greatly improves collision avoidance ability of robots in complex unknown scenes.
We design a feature extraction guidance module to improve the expressiveness of pseudo-laser measurement in the DRL process, thereby enhancing the ability of robots to understand and handle the obstacle avoidance task.
To accommodate our vision-based policy, we introduce a new data augmentation to enhance the robustness of the model and reduce the sim-to-real gap. Experiments show that the robot moves smoothly through the adaptive data augmentation.

Ii Related Work

Ii-a Laser-based navigation

Although the artificial potential fields [7], dynamic window approaches [11], these traditional navigation methods can achieve safe obstacle avoidance in simple scenarios. However, the map information needs to be known, and it takes a lot of time or even fails due to the sensitivity to hyperparameters and local minima in some complex dynamic scenes. For example, the advanced traditional method ORCA [47] needs to know the location information of other agents and static map information. Its generalization is poor in complex scenes due to tedious parameter adjustment.

In recent years, with the development of deep learning, [31, 8, 10, 5] and [12] have shown the effectiveness of learning-based navigation methods. Pfeiffer et al. [36] proposed a data-driven end-to-end motion planner. They used the data generated in the simulation environment through the ROS navigation package to train a model. It is able to learn the complex mapping from laser measurement and target positions to the required steering commands for the robot. This model can navigate the robot through a previously unseen environment and successfully react to sudden changes. Nonetheless, similar to the other supervised learning methods, the performance is seriously limited by the quality of the dataset. To solve the problem, Tai et al. [43] proposed a mapless motion planner based on DRL to map 10-dimensional range findings to actions. The robot can safely navigate through a priori unknown environment without collision in this way. However, due to the sparse laser and the simple training scene, the model does not perform well on dynamic obstacles. Long et al. [27] proposed a multi-scenario and multi-stage training framework based on the PPO [39] reinforcement learning method. By inputting three consecutive frames from a $180^{\circ}$ laser scanner, obstacle avoidance for dynamic objects is achieved, and it is extended to multi-robot applications. Fan et al. [9] integrated the learning algorithm of Long et al. [27] into the hybrid control framework to enable the agent to execute different decision models for different situations, further improving the robustness and effectiveness of the policy. On this basis, Choi et al. [6] proposed an LSTM agent with Local-Map Critic (LSTM-LMC) to reduce the FOV to $90^{\circ}$ and achieve the performance of $180^{\circ}$ . They used point cloud mapping and slicing to obtain laser data from the depth map. Due to the limitation of depth map slicing of point cloud mapping, the robot performs poorly on some complex ground (e.g., water surface, stairs, clothes, etc.) and irregular obstacles. Tang et al. [45] proposed a spiking deep deterministic policy gradient (SDDPG) framework to train a LIF-based spiking actor network (SAN) for mapless navigation. This method combines spike neural network(SNN) with DRL to realize a low-power obstacle avoidance framework based on laser sensors. However, the training scenarios limit their ability to adapt to dynamic obstacles, and the one-dimensional laser sensor makes it difficult to deal with complex obstacles. We continued the work of [27, 26] and reduced the FOV to $90^{\circ}$ in our way. And the robot has the ability to avoid obstacles to irregular objects through our pseudo-laser measurement.

Fig. 2: The pipeline of our entire framework. We fuse the distance and semantic information from the RGB image to generate pseudo-laser measurement and send them to the DRL network designed to obtain reasonable actions for navigating.

Ii-B Vision-based navigation

Due to laser scanners are expensive and can only capture limited information in the FOV. The camera can provide rich information about the operating environment of the robot and is low-cost, light-weight, and applicable for a wide range of platforms. There are a variety methods of vision-based navigation and obstacle avoidance, such as through optical flow [41, 30], detection of vanishing points [3], visual SLAM [32] and end-to-end mapping based on deep learning [23, 15]. Nguyen et al. [33] used Sum of Absolute Differences (SAD) [20, 48, 24] correlation method to calculate the optimal disparity of stereoscopic cameras to develop an obstacle avoidance system for autonomous wheelchairs. However, traditional difference matching methods have difficulty in distinguishing complex obstacles on the ground. Tai et al. [42] proposed an end-to-end model that can directly map depth maps to actions. However, operating a mobile robot to collect training data in the real world is inconvenient and time-consuming. [29, 57] and [2] trained policy for mapping from RGB image to action in the simulation environment. They perform well in training scenarios, but it is difficult to transfer to real-world environments. Wang et al. [49] achieved better results in robot navigation tasks by designing modular DRL methods from rewards and punishments. Although good results have been achieved in the simulator and simple real scenes, its state is expressed as a mixture of RGB images and laser data, which makes it difficult for its model to handle complex real scenes. (textures, colors, complex obstacles, etc.) Gordon et al. [16] used Habitat scene renderer [37] to train visual navigation models, thereby reducing the gap between sim-to-real. On this basis, Lu et al. [28] proposed a method combining graph neural network(GNN) and DRL to extract the structural relationship features of the scene, improve the learning efficiency of reinforcement learning, and further improve the accuracy of visual navigation. However, in unseen scenarios, these models need to be retrained. Similar works also include [58, 54, 55, 52] and [53]. Chen et al. [4] proposed a criterion for evaluating the gap between sim-to-real, and based on this proposed a visual obstacle avoidance method. However, due to the use of imitation learning and simple scenarios, the method is difficult to apply in challenging scenarios. Instead, we use a novel depth map slicing method that combines depth and semantic information to achieve mapping from a single RGB image to pseudo-laser measurement. The robot can safely avoid obstacles in more complex scenes in this way.

Ii-C Real-world random noise

Owing to the over-idealization of sensors in the simulation environment, it is difficult to apply the policy trained in the virtual environment to real scenarios. To overcome this problem, [44, 35] and [46] introduce random noise during the training process to simulate real sensor data. They showed that noise improves the robustness of the agents in challenging real-world. In this work, we also use data augmentation during training that simulates real scene depth slicing data to make the model better transfer to the real environment.

Iii Methodology

In this paper, we propose the pseudo-laser measurement, which is generated by distance and semantic information from an RGB image, to improve the environmental-perception ability of robots for the efficient DRL policy. In addition, we design an FEG module and a data augmentation method to make the pseudo-laser measurement play a more reasonable role in the DRL module. Based on this, our method can effectively perceive irregular obstacles and complex ground information in the scene, and reasonably model the current observation, to perform efficient DRL and obstacle avoidance in complex scenes.

As shown in Fig. 2, we obtain the corresponding depth map and semantic mask from the input RGB image. Then, we cull out the traversable region from the depth map through the semantic mask, since the depth of the traversable region under real conditions affects collision avoidance. After that, we propose a dynamic local minimum pooling operation that slices the semantic-depth map to generate the pseudo-laser measurement. Based on this, we input the pseudo-laser to the DRL network and sample the actions to guide the agent to avoid obstacles. In order to enable robots to capture the information that needs more reasonable attention in the input pseudo-laser, we design a FEG module. In the following, we first describe the method we proposed to generate pseudo-laser measurement, then introduce the DRL framework and submodules. Finally, we depict the data augmentation method of our model.

Iii-a Pseudo-laser measurement.

Our approach is inspired by the previous work [27] that uses the laser sensor to obtain the distance measurement away from the nearest obstacles and formulates them as a one-dimensional vector. The major advantage of this method is that they do not need to have the perfect sensing for the environment to perform collision avoidance, which significantly reduces the sim-to-real gap. The motivation of pseudo-laser measurement we proposed is to ignore the difference between simulation and reality. As simple and accurate one-dimensional data, pseudo-laser can also reduce the training difficulty of DRL while incorporating rich semantic information to improve robustness to complex obstacles. Pseudo-laser measurement is fused through depth information and semantic information of the current state so that our method is robust in complex scenes. In particular, our approach for generating pseudo-laser measurement is composed of two stages: semantic masking and deep slicing.

Semantic-depth map & Semantic mask module. We use the depth estimation model [51] and the image semantic mask model [21] as two parallel branches to obtain the depth map $M_{D}$ and semantic mask $M_{G}$ of the input RGB image $I$ ( $I \in R^{H \times W}$ ), respectively. The depth estimation branch employs pretrained MobileNet [19] as the encoder and nearest-neighbor interpolation upsampling layers with the depthwise separable convolution in the decoder. For indoor and outdoor scenes, we train the depth estimation branch on the NYUv2 [40] and KITTI [14] datasets, respectively. Intuitively, since the information of the traversable area (e.g., ground) cannot help the robot avoid obstacles, we need to delete the depth of the traversable area in the estimated depth map to avoid its impact on the perception of the environment.

Traversable region - Areas outside of any geometric and non-geometric substances that will affect robot navigation and obstacle avoidance.

To achieve this purpose, we leverage the second branch to produce a semantic mask providing pixel-level traversable region labels. Since the only concern at this stage is ‘which pixel in the image belongs to the traversable region’, our second branch can provide a binary mask $M_{G}$ , indicating which pixel of the image belongs to the traversable region, i.e., $M_{G} (i, j) = 0$ if the pixel at $(i, j)$ belongs to the traversable region, otherwise $M_{G} (i, j) = 1$ . Due to the useless depth of traversable region for collision avoidance task, we generate a depth map without traversable region depth via $M = M_{D} ⊙ M_{G}$ , where $⊙$ refers to the pixel-wise multiplication. Therefore, the semantic mask $M_{G}$ obtained by the Semantic Mask Module can effectively handle complex ground scenes. (Fig. 2 shows more details)

Depth slicing module. In order to obtain the pseudo-laser measurement $L$ , the naive solution is to slice the depth map along a specific row, i.e., $L = M (i, :)$ , where $i$ indicates the index of the row and thus $L \in R^{1 \times W}$ . However, the naive solution may not be robust for irregular obstacles. For instance, if the height of slicing is on the feet of the table, the robot is likely to collide with the upper part of the table. When there exists a small obstacle, such a naive slicing method may cause a similar problem. To address this problem, we propose an efficient solution, i.e., dynamic local minimum pooling operation. Specifically, we only keep the lower half of the depth map for computation, i.e., $^M = M (\frac{H}{2} : H, :)$ , as shown in Fig. 2 Depth Slicing Module. This is because the upper half of the image often provides little context information for collision avoidance. Then, we perform column-wise minimum pooling to select the minimal value along each column, i.e., the pseudo-laser measurement $L$ can be computed as $L(j)=min^M(:,j),∀j={1,⋯,W}$ ), where $^M (i, j) \neq 0$ . Intuitively, the minimum pooling operation approximately localizes the nearest obstacles of any height in the environment. Hence, the extracted pseudo-laser measurement with spatial semantic information about the scene are forwarded to the DRL framework.

Iii-B Deep reinforcement learning module.

In this part, we hope that agents can effectively avoid obstacles in complex unknown environments and rely on vision only to safely reach the location of a man-made designated target.

DRL problem setting. Affected by the FOV of the sensor, the agent cannot fully perceive the surrounding environment during the training process, thus we define the training process as a partially observable Markov decision process (POMDP). In simple terms, a POMDP is a cyclical process in which an agent takes action to change its state to obtain rewards and interact with the environment.

Because the action space of robots is continuous, we adopt the Actor-Critic (AC) framework based on policy gradient to implement our DRL obstacle avoidance policy.

Observation space. For the observation $o$ of the agent, we use the pseudo-laser measurement with $90^{\circ}$ horizontal FOV.
Action space. For the action $a$ of the agent, we use R $^{2}$ vectors for linear( $v$ ) and angular( $w$ ) velocities. Linear velocity of the agent is in range $v \in (0.0, 1.0)$ m/s and angular velocity is in range $w \in (- 90.0, 90.0)^{\circ}$ /s. We use normalized angular velocity which is in the range $w \in (- 1.0, 1.0)$ as outputs of the neural networks. Note that the robot cannot move backward (i.e. $v < 0.0$ m/s) because the laser sensor cannot cover the back area of the robot.
Reward function. Our objective is to avoid collisions during navigation and minimize the mean arrival time of robots. The reward function follows [27].

r_{i}^{t} = (r_{g o a l})_{i}^{t} + (r_{c o l l i s i o n})_{i}^{t} + (r_{r o t a t i o n a l})_{i}^{t}

(1)

The total reward $r$ received by robot $i$ at timestep $t$ is a sum of three terms, $r_{g o a l}$ , $r_{c o l l i s i o n}$ , $r_{r o t a t i o n a l}$ . $r_{g o a l}$ represents the reward of whether the agent has reached its goal. If the agent collides, $r_{c o l l i s i o n}$ is a penalty. And $r_{r o t a t i o n a l}$ encourages the agent to move smoothly.

(r_{g o a l})_{i}^{t} = {\begin{matrix} r_{a r r i v a l} & if ∥ p_{i}^{t} - g_{i} ∥ < 0.1 ω_{g} (∥ p_{i}^{t - 1} - g_{i} ∥ - ∥ p_{i}^{t} - g_{i} ∥) & otherwise \end{matrix}

(2)

$p_{i}^{t}$ is the position of robot $i$ at time $t$ and $g_{i}$ is the goal position of robot $i$ . In order to avoid collisions between agent and obstacles, we set $r_{c o l l i s i o n}$ as a penalty.

(r_{c o l l i s i o n})_{i}^{t} = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} r_{c o l l i s i o n} & % if ∥ p_{i}^{t} - p_{j}^{t} ∥ < 2 R or ∥ p_{i}^{t} - B_{k} ∥ < R 0 & otherwise \end{matrix}

(3)

$B_{k}$ is the position of obstacle $k$ and $R$ is the radius of the robot. To encourage the robot to move smoothly, a small penalty $(r_{r o t a t i o n a l})_{i}^{t}$ is introduced to punish the large rotational velocities.

(r_{r o t a t i o n a l})_{i}^{t} = ω_{w} | w_{i}^{t} | if | w_{i}^{t} | > 0.7

(4)

We set $r_{a r r i v a l}$ = 15, $ω_{g}$ = 2.5, $r_{c o l l i s i o n}$ = -15 and $ω_{w}$ = -0.1 in the training procedure.

Our DRL framework is based on the module of [27]. Compared with their method, the FOV of our framework is much smaller (i.e., $90^{\circ}$ ) and the observation may be noisier. In order to make up for the limitation of the sensor FOV, we introduce LSTM into the decision policy network. At the same time, we design a FEG module to weight the input pseudo-laser measurement to further improve the ability of robots to understand and capture the surrounding environment.

LSTM module. As shown in Fig. 2, we introduce the LSTM into our network to solve the limited FOV problem. LSTM has been applied as a temporal model to address the long-term dependency issue [18]. Here, LSTM allows the agent to make more reasonable actions based on the current state and the historic states. In specific, we add LSTM behind the fully connected layer where the behavioral state (i.e., the goal g and velocity v of the robot) and the observation O are merged. The input of LSTM is a 256-dimensional feature and the output is also a 256-dimensional feature. Finally, the feature will be passed through another fully-connected layer and to map an action.

Feature extraction guidance module. In the DRL network, we take three consecutive pseudo-laser measurement ( $L_{t - 2}$ , $L_{t - 1}$ , $L_{t}$ ), goal ( $G_{t - 2}$ , $G_{t - 1}$ , $G_{t}$ ) and velocity ( $V_{t - 2}$ , $V_{t - 1}$ , $V_{t}$ ) as input. Such input can enable our agents to infer the motion of surrounding objects, and thus make more effective collision avoidance behaviors. However, the different times and positions of the goal and velocity data in the pseudo-laser measurement have different effects on the current obstacle avoidance of the agent. Data representation of different modality probably has hierarchical attention in the pseudo-laser measurement. In order to make the agent better capture the reasonable information in the pseudo-laser measurement to perform efficient DRL, we introduce a FEG module to predict a mask to weight the pseudo-laser measurement. Specifically, as shown in the feature extraction guidance module of Fig. 2, velocity vectors ( $R^{2}$ ) and goal vectors ( $R^{2}$ ) are tiled to match the shape of the pseudo-laser vector ( $R^{d_{l a s e r} * 2}$ ) where d $_{l a s e r}$ is the size of the pseudo-laser vector. Then, these tiled vector is concatenated to the pseudo-laser vector ( $R^{d_{l a s e r} * 1}$ ), resulting in a matrix that has a size of $R^{d_{l a s e r} * 5}$ . We stacked these matrices from 3 recent timesteps to obtain a network input tensor ( $R^{d_{l a s e r} * 5 * 3}$ ). This tensor first passes through several convolutional layers and deconvolution layers whose activation function is RELU, and finally gets the final mask M $_{G}$ ( $R^{d_{l a s e r} * 3}$ ) through the Sigmod activation function. After obtaining the mask, we multiply the mask with the pseudo-laser measurement to obtain the final weighted pseudo-laser measurement. Finally, the weighted pseudo-laser measurement can be used as an input to the DRL network, and the obstacle avoidance action of the agent is returned. The DRL model can pay attention to laser data in the goal direction, in the velocity direction, and in close proximity, by weighting the pseudo-laser measurement. (See Sec. IV for details.)

Training. Similar to the previous work [27], our training is conducted on a virtual environment, in which agents interact with the environment and receive rewards from the environment to guide their learning. Since the DRL algorithm performs backpropagation based on the reward given by the interactive environment, if the environment is too complex, or the task is too difficult, the environment will feedback a sparse reward to the agent, which will cause the DRL framework to converge too slowly. In order to improve the efficiency of algorithm training, following the multi-stage multi-scenario training strategy [27], we first train our DRL framework in simple scenarios for simple tasks and then transfer the trained model to a more complex environment for further training.

Fig. 3: The raw laser measurement captured in the virtual scene and the noisy laser measurement after our data augmentation.

Iii-C Data augmentation.

To enhance the robustness of the DRL based model in real-world scenarios, we add specific noise to the training laser measurement collected in the simulation environment to reduce the sim-to-real gap. In real-world scenes, the measurement errors of observations often occur around the boundaries of objects if some parts of one object block the other, i.e., the depth values of the boundary of the blocking part may be uncertain. Therefore, we intend to add noise to emulate such measurement errors in order to produce adversarial data during training.

To identify the junction boundary from the training laser measurement, i.e., a one-dimensional vector with precise distance information, we assume that, if the difference of two adjacent values in the vector is larger than the threshold $α$ , there may exist a junction boundary. To add noise, we locate the neighborhood of the junction boundary in the vector and replace all the values within the neighborhood by linearly interpolating the endpoints of the neighborhood (see Fig. 3). On the other hand, for each vector element outside the neighborhood of the junction boundary, we adaptively add Gaussian white noise whose variance is proportional to the vector value. The policy can accommodate the sim-to-real gap of pseudo-laser measurement by introducing this data augmentation.

Fig. 4: The training and testing scenarios built for the DRL framework in the Stage simulation environment. (1)-(3) are used for training. (4)-(6) are used for testing.

Iv Experimental Results

In this section, we first test the DRL framework and evaluate its performance. We also show the visualization results of the FEG module. Then, we conduct experiments on the entire monocular camera-based collision avoidance method. Finally, we apply our model to a robot in real-world scenarios.

Implementation details. Our algorithm is implemented in Pytorch. We train the DRL policy on a computer equipped with an i7-7700 CPU and NVIDIA GTX 1080Ti GPU for approximately 40 hours. We build training and testing scenarios based on the Stage mobile robot simulator (as shown in Fig. 4) and train 4 models with hyperparameters (see Table I) while adopting a multi-scene multi-stage alternating training method. During training, the convergence of our reward function is shown in Fig. 5. Specifically, we train our model in three training scenes (see Fig. 4) one after another. Note that we use a multi-agent collision avoidance task where each agent is assigned a shared collision avoidance policy, which can increase the robustness and training speed of our model.

Fig. 5: The average (dark curve) and standard deviation (light area) of rewards during the three stages of training episodes.

Hyper parameter	Value
Batch size	1024
Maximum time steps	150
Training episode	1,500,000
Discount factor $γ$	0.99
Learning rate	5e-5
LSTM unroll	20
Target network update ratio	0.01
KL penalty coefficient	15e-4
the noise threshold $α$	0.5

TABLE I: Hyper parameter settings in our experiments.

We use the following metrics to evaluate the performance of different navigation algorithms:

Success Rate. The number of times a robot reaches the target waypoint without collision and overtime.
Average Time. The average time it takes for a robot to reach the goal safely.

Fig. 6: The trajectories generated by different models during navigating an agent in a virtual scene with two waypoints.

Network structure experiment. To demonstrate that the Control Decision Stage of the $90^{\circ}$ -FOV is reliable, we evaluate 4 different network structures (CNN models with $180^{\circ}$ and $90^{\circ}$ FOV, LSTM based model and LSTM+Attention model with $90^{\circ}$ -FOV) in DRL models relying on laser sensors and ORCA [47] as baseline in the test scenarios shown in Fig. 4. Note that the CNN model with $180^{\circ}$ -FOV is [27], while the CNN model with $90^{\circ}$ -FOV is the same model yet trained with sensor observation of $90^{\circ}$ -FOV. The test scenario includes the collision avoidance tasks against static/dynamic obstacles: crossing, through walls, through fixed obstacles, and multi-robot obstacle avoidance, which are challenging scenes for evaluating the collision avoidance ability of our model against different types of obstacles. We randomly assign an initial position and a goal position to each robot and test it 100 times in different scenarios. The results are shown in Table II. We also visualize the example trajectories of each model, as shown in Fig. 6. Compared with peer models, our model can generate a smooth trajectory that successfully passes through two waypoints and reaches the goal without getting stuck or collisions.

Model

ORCA

[47]

CNN

[27]

CNN

LSTM

+ FEG

(Ours)

Field of View (FOV)

180^{\circ}

90^{\circ}

90^{\circ}

90^{\circ}

Scene 4

Success

rate

92%

98%

79%

87%

97%

Average

time

7.516

6.331

8.035

6.589

6.989

Scene 5

Success

rate

72%

86%

16%

78%

87%

Average

time

16.844

10.882

23.549

11.752

12.954

Scene 6

Success

rate

78%

95%

63%

80%

95%

Average

time

18.551

13.563

25.662

14.216

15.013

TABLE II: PERFORMANCE OF AGENTS WITH VARIOUS FOV AND ARCHITECTURES

We observe that our method is superior to the collision avoidance method ORCA [47] and comparable to the model with $180^{\circ}$ -FOV, while our model has a better success rate. For the $90^{\circ}$ -FOV model, the success rate drops significantly. With the LSTM module introduced into the model, the performance of the model in a limited FOV is improved. Based on this, we propose the FEG module that CNN has reasonable attention to the current observation. Intuitively, it can make the policy reach or even exceed the model with $180^{\circ}$ -FOV. However, due to the complexity of the network structure, the robot may occasionally decelerate significantly in dangerous situations, resulting in an increase of average time.

Fig. 8: The success rate of the three models at different times illustrates the convergence rate of the different models.

Convergence speed experiment. We find that after introducing the FEG module, the convergence speed of the algorithm is greatly accelerated, and the policy can achieve a better effect in a shorter time. Therefore, we conduct a comparative experiment of convergence speed. Specifically, we compare the three models (model without mask [27], model with the manually set mask [6], model with the network learning mask), and train a fixed number of episodes for each model. The value of episodes represents the training time. We conduct a test navigation success rate for different training time and different model policies. As shown in Fig. 8, the results show that our model has the fastest convergence rate, with a success rate of 75% at 10000 episodes, and a success rate of more than 90% at 15000 episodes. The model without mask only reaches a 90% success rate at 25000 episodes, showing a steady growth trend. Although the manually set mask has a faster convergence rate than the model without the mask, the effect is not ideal compared with the model where we obtain the mask through learning. In the same training times, our FEG module always maintains the highest success rate. Intrinsically, by obtaining masks that have hierarchical attention to the current observation, the network can more efficiently extract reasonable information for obstacle avoidance tasks and accelerate policy convergence.

Feature extraction guidance module analysis. To study the internal effect of the FEG module, we show the visualization results of the learned mask. Fig. 7(a) shows the current state of the robot, green represents the laser range, the red arrow is the current relative direction to the goal, the yellow arrow is the current velocity direction, and the red dot is the goal position. Fig. 7(b) to 7(d) are the visualization of laser input, feature mask and fusion result. Note that the visualization result is divided into three layers from top to bottom, corresponding to three consecutive times of laser measurement. We observe that the weight of the FEG module mask gradually increases from top to bottom. The network has different attention to different moments and pays more attention to the data at the current moment. Mask is obtained by fusion of laser input, goal, and velocity. Therefore the laser measurement in the two directions of the current position to goal direction and speed direction are given higher weight. The position of the obstacle (small value) in the laser input has a higher weight, thus the network pays more attention to the obstacles at closer distances, while the safety distance pays less attention.

Model

ORCA

[47]

CNN

[27]

CNN

LSTM

+ FEG

(Ours)

Field of View (FOV)

180^{\circ}

90^{\circ}

90^{\circ}

90^{\circ}

Scene 11

Success

rate

66%

72%

24%

63%

69%

Average

time

105.764

95.056

121.655

98.432

89.535

Scene 12

Success

rate

61%

70%

39%

64%

68%

Average

time

202.766

176.631

215.906

194.024

180.787

Scene 13

Success

rate

71%

75%

52%

68%

77%

Average

time

357.803

296.438

349.970

321.009

306.556

TABLE III: PERFORMANCE OF AGENTS WITH VARIOUS FOV AND ARCHITECTURES IN DENSE CROWD ENVIRONMENT

Dense crowd experiment. Our model is learned in a multi-agent scenario. Agents act as dynamic obstacles to each other, and all agents share the same policy. To show that our model is not only applicable to multi-agent scenarios with the same policy, but also to dense and complex crowd scenarios, we conduct dense crowd experiments. We use the pedestrian motion simulator [34] proposed by Billy et al. to generate pedestrian trajectories in RVIZ of ROS, use simple cubes instead of pedestrian models for testing in Gazebo, and transmit pedestrian coordinates from RVIZ to Gazebo in real-time, thus the cubes can simulate the movement of pedestrians. We deploy the robot model to test in three different scenarios (as shown in Fig. 9), randomly generate the initial position and the end position, compare the success rate and time of the five models, and each model is tested 100 times in each scenario. As shown in Table III, we can observe that it is similar to Table II. The success rate of our $90^{\circ}$ -FOV model is similar to the $180^{\circ}$ -FOV CNN model, while the $90^{\circ}$ -FOV CNN model performs poorly. Because pedestrians are too dense and have irregular movements, the success rate is generally low. Due to the deepening of the network, our policy has taken a relatively longer time.

Fig. 9: The virtual scene of the dense crowd experiment, the left is the simulator [34] in RVIZ, and the right is our scene in Gazebo where the pedestrian data of the simulator is synchronized.

Model

_{B t m . L a s .}

_{T o p . L a s .}

_{D e p .1 - D .}

_{D e p . P o l .}

_{D e p .1 - D . S e m .}

_{D e p . P o l . S e m .}

(proposed)

_{D e p . S e m . N o i .}

(proposed)

Scene 7

Success rate

23%

14%

37%

82%

89%

Average time

15.743

18.098

13.735

13.773

11.937

Scene 8

Success rate

100%

85%

77%

94%

85%

100%

Average time

18.075

18.553

19.011

21.251

18.721

18.771

18.972

Scene 9

Success rate

93%

87%

71%

94%

82%

94%

96%

Average time

23.736

22.268

27.469

27.087

24.242

27.005

24.738

Scene 10

Success rate

80%

70%

60%

70%

65%

80%

85%

Average time

147.515

160.742

201.024

174.732

198.995

184.951

168.546

TABLE IV: PERFORMANCE OF AGENTS WITH DIFFERENT MODELS IN 3D SCENES OF DIFFERENT COMPLEXITY.

Test scenario of a single irregular obstacle deployed in Gazebo, including coffee table, table, fire hydrant, construction cone and cabinet. — (a) cafe table

The entire monocular camera-based method. To demonstrate the effectiveness of our pseudo-laser measurement, we perform a visual simulation experiment. We build a 3D simulation environment based on Gazebo, as shown in Fig.11. Scene 7 consists only of challenging irregular objects, while Scene 8-10 are complex corridor scenes with pedestrians. We build a simple two-wheel differential robot in Gazebo, deploying laser sensors and RGB-D cameras. The height of the robot is higher than the height of the table in the scene. In order to make the depth estimation and semantic segmentation models suitable for the Gazebo simulation environment, we collect scene-specific image data and pretrain the model for 6 hours so that it can be applied to virtual scenes. In order to improve the efficiency of image processing, we reduce the depth map by 5 times. The model runs on a computer with NVIDIA GTX 1080Ti GPU and an i7-7700 CPU, which can reach 10 frames per second.

Fig. 11: Different 3D test scenarios are used for our experiments. (7) A scene consisting of only irregular objects. (8) A simple static scene. (9) A more complex scene with pedestrians. (10) A complex corridor scene.

We compare 7 models, lasers on the bottom of the robot (E $_{B t m . L a s .}$ ), laser on top of the robot (E $_{T o p . L a s .}$ ), one-dimensional slice of depth map (E $_{D e p .1 - D .}$ ), dynamic local minimum pooling of depth map (E $_{D e p . P o l .}$ ), one-dimensional slice of depth map combined with semantic mask (E $_{D e p .1 - D . S e m .}$ ), dynamic local minimum pooling of depth maps combined with semantic mask (E $_{D e p . P o l . S e m .}$ ), and dynamic local minimum pooling of depth map and semantic mask with noise (E $_{D e p . S e m . N o i .}$ ). We randomly generate a starting position and a goal position for a robot in each scene and test it 100 times. The corridor scene is set with 4 waypoints, and the robot moves circularly in the scene and tests for 20 times. The results are shown in Table IV.

Model

_{B t m . L a s .}

_{T o p . L a s .}

_{D e p .1 - D .}

_{D e p . P o l .}

_{D e p .1 - D . S e m .}

_{D e p . P o l . S e m .}

(proposed)

_{D e p . S e m . N o i .}

(proposed)

cafe_table

Success rate

72%

24%

68%

18%

90%

92%

table

Success rate

66%

18%

66%

22%

88%

90%

fire_hydrant

Success rate

94%

92%

86%

80%

88%

96%

98%

construction cone

Success rate

96%

44%

52%

86%

38%

92%

96%

cabinet

Success rate

26%

34%

38%

64%

56%

90%

94%

TABLE V: PERFORMANCE OF AGENTS WITH DIFFERENT MODELS FOR DIFFERENT IRREGULAR OBSTACLES.

We observe that, in Scene 7 consisting only of irregular objects, almost all the comparison models fail to accomplish the task. Only our slicing method takes the shape of the obstacle into account and achieves the best performance. Since the model E $_{B t m . L a s .}$ considers the shape of the bottom of obstacles only, it reacts poorly towards obstacles such as desks. Similarly, the model E $_{T o p . L a s .}$ can only scan obstacle information at a certain height, which is not able to handle lower objects and cone objects. The model E $_{D e p .1 - D .}$ does not perform well for all objects since it falsely integrates the depth of the traversable region. The model E $_{D e p . P o l .}$ shows some successful collision avoidance behaviors for some objects. However, the traversable region information is not considered, therefore the navigation success rate is still poor. Although the model E $_{D e p .1 - D . S e m .}$ considers the depth of the traversable region, it still only captures one-dimensional data, which is not effective. In contrast, the model E $_{D e p . P o l . S e m .}$ considers both the depth of traversable region and the overall shape of the obstacle, which can achieve a better success rate of obstacle avoidance. We also test in more complex scenes, i.e., Scene 8-10, and the results show that our method has better robustness and generalization, and even better than laser sensors. Based on this, we also introduce a model trained via our proposed data augmentation method E $_{D e p . S e m . N o i .}$ . During training, the data augmentation setting details are as below. For identifying junction boundaries in laser measurements, we set the threshold for adjacent values as 0.5, and the range of neighborhoods is set as 8. For non-boundaries, we add the Gaussian white noise to each value of the laser measurement for data augmentation. The scale of the noise is 0.07 times the value. Table IV shows that the addition of noise in data augmentation for training the model has the highest success rate. Without such data augmentation, the agent will behave hesitantly and it may rotate at a large angle due to the difference between the pseudo-laser estimated from the depth map and the actual laser sensing data in the virtual environment.

Single irregular object experiment. To further illustrate the effectiveness of our policy on obstacle avoidance for irregular objects, we tested 5 different irregular obstacles (cafe table, table, fire hydrant, construction cone, and cabinet) in the Gazebo environment (as shown in Fig. 10). For the robot to randomly generate an initial position in front of each obstacle and a target position after the obstacle, each obstacle is tested 50 times. The results are shown in Table V. We can observe that all strategies have a high success rate for fire hydrant objects because it is similar to a cylindrical shape and is more regular than other objects. However, for the conical object and the table, the simple laser obstacle avoidance policy and the simple depth slicing policy are not ideal. And our policy has a higher success rate of obstacle avoidance for all irregular objects, showing good generalization.

Fig. 12: Visualization of the process of acquiring pseudo-laser measurement using deep slicing in some complex ground scenes. Each column represents a scene (*i.e.*, a scene with a water surface, a scene with a slope and a scene with clothes on the floor), and each row represents the image processing procedure (including RGB image, depth image, semantic image, semantic-depth image and pseudo-laser measurement).

Complex ground scenes experiment. In order to further explain the role of our fusion of semantic information, as shown in Fig. 12, we visualize the process of acquiring pseudo-laser measurement in three complex scenarios. Taking Fig. 12(A) as an example, we obtain a depth map from an RGB image and a binary classification semantic mask with only the traversable region and background. After fusing into a semantic-depth map, we use the depth slicing method we proposed to obtain the final pseudo-laser measurement. The traditional method obtains laser measurement from the depth map by horizontal slice mapping of the point cloud. Since semantic information is not considered, the water surface is also cut off from the ground. However, in reality, the water surface is not the safe passable region of the robot. Therefore we use semantic information to view the water as an unpassable region. The depth information of the water surface is also taken into account during the slicing process, thus providing the robot with the safest pseudo-laser measurement. In Fig. 12(B), there is a highly misaligned slope. For a simple horizontal slice, it is difficult to set a threshold, resulting in ground depth affecting navigation. Through semantic information, we can obtain the traversable region of robots, so as to generate the most efficient and reasonable pseudo-laser measurement. There is a piece of clothing on the ground in Fig. 12(C). In fact, the clothing cannot be crushed and should be regarded as an obstacle. We can obtain the pseudo-laser measurement considering the depth of the clothing. Therefore, the method of obtaining the semantic mask of the traversable region through the semantic information and then performing depth slicing can enable the robot to effectively deal with scenes with the complex ground.

Fig. 13: Visualize the process of acquiring pseudo laser to face complex obstacles in real-world scenes. Each column represents a complex obstacle (*i.e.*, a chair, a table, clothes and glass), and each row represents the image processing procedure (including observation, RGB image, semantic-depth image, pseudo-laser measurement and bottom-laser device observation data).

Visualization experiment of complex obstacles in real-world scenes. We deploy our method in a real-world scene, and visualize the generation process of pseudo-laser measurement when faced with complex objects, as shown in Fig. 13. Compared with laser device observation data, our approach with semantic information can effectively segment the traversable region, extract the edge contours of irregular objects, and identify ‘special obstacles’ in complex ground scenes (as shown in the red boxes in Fig. 13). Thereby our framework can perform efficient DRL to obtain reasonable actions through the generated pseudo-laser measurement. See the supplementary video for more details.

Fig. 14: Experimental results of the limitations of the monocular-visual obstacle avoidance framework for obstacles of different sizes with different initial distances.

Fig. 15: The trajectories generated by different models during navigating an agent in an office scene with two waypoints and visualization of the results at key locations.

Monocular-visual limitation experiment. Visual obstacle avoidance will be affected by the current camera perspective. Although our policy will avoid obstacles in advance when facing obstacles, in order to study the limitations of our method, we manually initialize the location of obstacles to test the navigation success rate. We find that navigation failed when obstacles occupied most of the FOV, thus different obstacles have different navigation safety distances. We use walls of different sizes as test obstacles, and each wall tests multiple different distances.

Distance - The quotient of the distance between the robot and the obstacle and the maximum distance taken by the virtual camera (6.0m).

The result is shown in Fig. 14. The navigation success rate is lower when the initial position is closer to the obstacle. Obviously, the greater the space occupied by obstacles, the greater the distance for policy failure. When the obstacle occupies 65% of the maximum angle of view ( $90^{\circ}$ -FOV), our policy starts to perform poorly, and the success rate decreases. When it occupies 80%, our policy will fail, showing a hover or even collision.

Fig. 16: Illustration of the mobile robot platform used in our hardware experiments. It has 5 Intel Realsense D435 depth cameras, but we only use RGB images from the middle one of them. Two lasers at the top and bottom are used for testing.

Fig. 17: Our real test scene, a daily office.

Iv-a Hardware experiment

In order to see the overall performance of our method in the real world, we deploy a ROS-based mobile robot as shown in Fig. 16 in a real scene constructed of cardboard boxes, tables, and chairs. The robot transmits the acquired images through a router to a local host with an Nvidia GTX 1080Ti GPU. Then the host performs image processing action prediction locally and sends speed information to the robot. The image processing can reach 10 frames per second in real-time. Our method can safely navigate and avoid obstacles in unseen scenes, and shows good robustness and generalization to static complex objects and challenging dynamic pedestrians.

In addition, we test the effect of different models in a daily office scene as shown in Fig. 17, and visualized the trajectories of different models and the processing results of our method at key locations. As shown in Fig. 15, we visualize the trajectories of 7 models in the overhead point cloud of the office scene, and all 5 models collided except for our proposed method. Among them, the three models of E $_{B t m . L a s .}$ , E $_{T o p . L a s .}$ and E $_{D e p .1 - D .}$ are difficult to perceive the sofa and the shorter table, and they all collide. E $_{D e p . P o l .}$ has difficulty in distinguishing the useless traversable region depth and mistakes the table as a path thus colliding with the large table. In our proposed method, E $_{D e p . P o l . S e m .}$ does not collide but has a more tortuous route and behaves hesitantly, especially at locations with large depth variations. E $_{D e p . S e m . N o i .}$ shows the best results, and we visualize the pseudo laser measurement generation process of E $_{D e p . S e m . N o i .}$ at key locations where other models have a high chance of collision. The results show more intuitively that our proposed method can effectively deal with complex scenes. See the supplementary video for more details.

Fig. 18: Failure cases. Our method may fail in a scene with mirrors because the depth estimation does not perform well for the depth of the mirror.

V Conclusion and Future work

In this paper, we propose a framework that uses a low-cost monocular RGB camera to accomplish obstacle avoidance for mobile robots. Against the complex obstacles in challenging environments, we design the pseudo-laser, which fuses distance and semantic information and exhibits good collision avoidance performance in unseen complex scenes. The FEG module we proposed can reasonably weight the observed data, thus the mobile robot can capture the surrounding environment well and achieve efficient collision avoidance. In addition, we use data augmentation to improve the performance of the policy in the real-world and introduce an LSTM module to solve the problem with the limited FOV. However, our method has limitations. Due to the poor performance of the depth estimation method for the mirror, our method is invalid in some indoor scenes with mirrors [56], as shown in Fig. 18. In future work, we will consider more complex concepts, such as robot positioning via pure vision [17], visual servoing [22] and the application of hierarchical reinforcement learning [25] in obstacle avoidance tasks.

Acknowledgment

This work was supported in part by National Key Research and Development Program of China (2021ZD0112400), the National Natural Science Foundation of China under Grant 61972067/U21A20491/U1908214, and the Innovation Technology Funding of Dalian (2020JJ26GX036).

References

[1] W. A. Arokiasami, P. Vadakkepat, K. C. Tan, and D. Srinivasan (2016) Interoperable multi-agent framework for unmanned aerial/ground vehicles: towards robot autonomy. Compl. Intel. Syst. 2 (1), pp. 45–59. Cited by: §I.
[2] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull (2019) A data-efficient framework for training and sim-to-real transfer of navigation policies. In Proc. IEEE Int. Conf. Robot. Autom., pp. 782–788. Cited by: §II-B.
[3] C. Bills, J. Chen, and A. Saxena (2011) Autonomous mav flight in indoor environments using single image perspective cues. In Proc. IEEE Int. Conf. Robot. Autom., pp. 5776–5783. Cited by: §II-B.
[4] G. Chen, H. Yu, W. Dong, X. Sheng, X. Zhu, and H. Ding (2019) Learning to navigate from simulation via spatial and semantic information synthesis. arXiv:1910.05758. Cited by: §I, §II-B.
[5] Y. F. Chen, M. Everett, M. Liu, and J. P. How (2017) Socially aware motion planning with deep reinforcement learning. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 1343–1350. Cited by: §I, §II-A.
[6] J. Choi, K. Park, M. Kim, and S. Seok (2019) Deep reinforcement learning of navigation in a complex and crowded environment with a limited field of view. In Proc. IEEE Int. Conf. Robot. Autom., pp. 5993–6000. Cited by: §I, §I, §I, §II-A, §IV.
[7] F. A. Cosío and M. P. Castañeda (2004) Autonomous robot navigation using adaptive potential fields. Mathematical and Computer Modelling 40 (9-10), pp. 1141–1156. Cited by: §II-A.
[8] T. Fan, X. Cheng, J. Pan, D. Manocha, and R. Yang (2018) Crowdmove: autonomous mapless navigation in crowded scenarios. arXiv:1807.07870. Cited by: §I, §II-A.
[9] T. Fan, P. Long, W. Liu, and J. Pan (2020) Distributed multi-robot collision avoidance via deep reinforcement learning for navigation in complex scenarios. The International Journal of Robotics Research 39 (7), pp. 856–892. Cited by: §II-A.
[10] A. Faust, K. Oslund, O. Ramirez, A. Francis, L. Tapia, M. Fiser, and J. Davidson (2018) PRM-rl: long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proc. IEEE Int. Conf. Robot. Autom., pp. 5113–5120. Cited by: §II-A.
[11] D. Fox, W. Burgard, and S. Thrun (1997) The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine 4 (1), pp. 23–33. Cited by: §II-A.
[12] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik (2015) Learning visual predictive models of physics for playing billiards. arXiv:1511.07404. Cited by: §II-A.
[13] L. Gao, J. Ding, W. Liu, H. Piao, Y. Wang, X. Yang, and B. Yin (2021) A vision-based irregular obstacle avoidance framework via deep reinforcement learning. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 9262–9269. Cited by: §I.
[14] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32 (11), pp. 1231–1237. Cited by: §III-A.
[15] A. Giusti, J. Guzzi, D. C. Cireşan, F. He, J. P. Rodríguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al. (2015) A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robot. Autom. Lett. 1 (2), pp. 661–667. Cited by: §II-B.
[16] D. Gordon, A. Kadian, D. Parikh, J. Hoffman, and D. Batra (2019) Splitnet: sim2sim and task2task transfer for embodied visual navigation. In Proc. IEEE Int. Conf. Comput. Vis., pp. 1022–1031. Cited by: §II-B.
[17] N. Hirose, F. Xia, R. Martín-Martín, A. Sadeghian, and S. Savarese (2019) Deep visual mpc-policy learning for navigation. IEEE Robot. Autom. Lett. 4 (4), pp. 3184–3191. Cited by: §V.
[18] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §I, §III-B.
[19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §III-A.
[20] Y. Huang, S. Chien, B. Hsieh, and L. Chen (2004) Global elimination algorithm and architecture design for fast block matching motion estimation. IEEE Transactions on circuits and systems for Video technology 14 (6), pp. 898–907. Cited by: §II-B.
[21] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) Ccnet: criss-cross attention for semantic segmentation. In Proc. IEEE Int. Conf. Comput. Vis., pp. 603–612. Cited by: §III-A.
[22] S. Hutchinson, G. D. Hager, and P. I. Corke (1996) A tutorial on visual servo control. IEEE Trans. Robot. Autom. 12 (5), pp. 651–670. Cited by: §V.
[23] D. K. Kim and T. Chen (2015) Deep neural network for real-time autonomous indoor navigation. arXiv:1511.04668. Cited by: §II-B.
[24] K. Lengwehasarit and A. Ortega (2001) Probabilistic partial-distance fast matching algorithms for motion estimation. IEEE Transactions on Circuits and Systems for Video Technology 11 (2), pp. 139–152. Cited by: §II-B.
[25] C. Li, F. Xia, R. Martin-Martin, and S. Savarese (2020) Hrl4in: hierarchical reinforcement learning for interactive navigation with mobile manipulators. In Conference on Robot Learning, pp. 603–616. Cited by: §V.
[26] T. Liu (2018) Robot collision avoidance via deep reinforcement learning. GitHub. Note: \urlhttps://github.com/Acmece/rl-collision-avoidance.git Cited by: §II-A.
[27] P. Long, T. Fanl, X. Liao, W. Liu, H. Zhang, and J. Pan (2018) Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In Proc. IEEE Int. Conf. Robot. Autom., pp. 6252–6259. Cited by: §I, §I, §I, §II-A, 3rd item, §III-A, §III-B, §III-B, TABLE II, TABLE III, §IV, §IV.
[28] Y. Lu, Y. Chen, D. Zhao, and D. Li (2021) MGRL: graph neural network based inference in a markov network with reinforcement learning for visual navigation. Neurocomputing 421, pp. 140–150. Cited by: §II-B.
[29] L. Ma, Y. Liu, J. Chen, and D. Jin (2019) Learning to navigate in indoor environments: from memorizing to reasoning. arXiv:1904.06933. Cited by: §II-B.
[30] C. McCarthy and N. Bames (2004) Performance of optical flow techniques for indoor navigation with a mobile robot. In Proc. IEEE Int. Conf. Robot. Autom., Vol. 5, pp. 5093–5098. Cited by: §II-B.
[31] J. Michels, A. Saxena, and A. Y. Ng (2005) High speed obstacle avoidance using monocular vision and reinforcement learning. In Proc. Int. Conf. Mach. Learn., pp. 593–600. Cited by: §II-A.
[32] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31 (5), pp. 1147–1163. Cited by: §II-B.
[33] T. H. Nguyen, J. S. Nguyen, D. M. Pham, and H. T. Nguyen (2007) Real-time obstacle detection for an autonomous wheelchair using stereoscopic cameras. In 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 4775–4778. Cited by: §II-B.
[34] B. Okal and K. O. Arras (2014) Towards group-level social activity recognition for mobile robots. In IROS Assistance and Service Robotics in a Human Environments Workshop, Cited by: Fig. 9, §IV.
[35] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In Proc. IEEE Int. Conf. Robot. Autom., pp. 1–8. Cited by: §II-C.
[36] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena (2017) From perception to decision: a data-driven approach to end-to-end motion planning for autonomous ground robots. In Proc. IEEE Int. Conf. Robot. Autom., pp. 1527–1533. Cited by: §I, §II-A.
[37] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al. (2019) Habitat: a platform for embodied ai research. In Proc. IEEE Int. Conf. Comput. Vis., pp. 9339–9347. Cited by: §II-B.
[38] S. Schroeder, M. Zilske, G. Liedtke, and K. Nagel (2012) Towards a multi-agent logistics and commercial transport model: the transport service provider’s view. Procedia-Social and Behavioral Sciences 39, pp. 649–663. Cited by: §I.
[39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv:1707.06347. Cited by: §II-A.
[40] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In Proc. Eur. Conf. Comput. Vision., pp. 746–760. Cited by: §III-A.
[41] K. Souhila and A. Karim (2007) Optical flow based robot obstacle avoidance. Int. J. Adv. Robot. Syst. 4 (1), pp. 2. Cited by: §II-B.
[42] L. Tai, S. Li, and M. Liu (2016) A deep-network solution towards model-less obstacle avoidance. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 2759–2764. Cited by: §I, §II-B.
[43] L. Tai, G. Paolo, and M. Liu (2017) Virtual-to-real deep reinforcement learning: continuous control of mobile robots for mapless navigation. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 31–36. Cited by: §II-A.
[44] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. arXiv:1804.10332. Cited by: §II-C.
[45] G. Tang, N. Kumar, and K. P. Michmizos (2020) Reinforcement co-learning of deep and spiking neural networks for energy-efficient mapless navigation with neuromorphic hardware. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 6090–6097. Cited by: §II-A.
[46] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 23–30. Cited by: §II-C.
[47] J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha (2011) Reciprocal n-body collision avoidance. In Robot. Res., pp. 3–19. Cited by: §II-A, TABLE II, TABLE III, §IV, §IV.
[48] J. Vanne, E. Aho, T. D. Hamalainen, and K. Kuusilinna (2006) A high-performance sum of absolute difference implementation for motion estimation. IEEE transactions on circuits and systems for video technology 16 (7), pp. 876–883. Cited by: §II-B.
[49] J. Wang, S. Elfwing, and E. Uchibe (2021) Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Networks 135, pp. 115–126. Cited by: §II-B.
[50] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Cited by: §I.
[51] D. Wofk, F. Ma, T. Yang, S. Karaman, and V. Sze (2019) Fastdepth: fast monocular depth estimation on embedded systems. In Proc. IEEE Int. Conf. Robot. Autom., pp. 6101–6108. Cited by: §III-A.
[52] Q. Wu, D. Manocha, J. Wang, and K. Xu (2020) Neonav: improving the generalization of visual navigation via generating next expected observations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10001–10008. Cited by: §II-B.
[53] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian (2018) Building generalizable agents with a realistic and rich 3d environment. arXiv:1801.02209. Cited by: §II-B.
[54] Y. Wu, Z. Rao, W. Zhang, S. Lu, W. Lu, and Z. Zha (2019) Exploring the task cooperation in multi-goal visual navigation.. In IJCAI, pp. 609–615. Cited by: §II-B.
[55] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi (2018) Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543. Cited by: §II-B.
[56] X. Yang, H. Mei, K. Xu, X. Wei, B. Yin, and R. W. Lau (2019) Where is my mirror?. In Proc. IEEE Int. Conf. Comput. Vis., pp. 8809–8818. Cited by: §V.
[57] W. Zhang, C. Ma, Q. Wu, and X. Yang (2020) Language-guided navigation via cross-modal grounding and alternate adversarial learning. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I, §II-B.
[58] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proc. IEEE Int. Conf. Robot. Autom., pp. 3357–3364. Cited by: §II-B.