BioSLAM: A Bio-inspired Lifelong Memory System for General Place Recognition

Peng Yin^1,*, , Abulikemu Abuduweili^1,*, Shiqi Zhao²
Changliu Liu¹, and Sebastian Scherer¹, Peng Yin, Abulikemu Abuduweili, Changliu Liu, and Sebastian Scherer are with the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA. (pyin2, abulikea, cliu6, basti)@andrew.cmu.edu.Shiqi Zhao is with the University of California San Diego, La Jolla, CA 92093, USA. (s2zhao@eng.ucsd.edu).^*Authors Peng Yin and Abulikemu Abuduweili contributed equally. Corresponding author: Peng Yin (pyin2@andrew.cmu.edu)

Abstract

We present BioSLAM, a lifelong SLAM framework for learning various new appearances incrementally and maintaining accurate place recognition for previously visited areas. Unlike humans, artificial neural networks suffer from catastrophic forgetting and may forget the previously visited areas when trained with new arrivals. For humans, researchers discover that there exists a memory replay mechanism in the brain to keep the neuron active for previous events. Inspired by this discovery, BioSLAM designs a gated generative replay to control the robot’s learning behavior based on the feedback rewards. Specifically, BioSLAM provides a novel dual-memory mechanism for maintenance: 1) a dynamic memory to efficiently learn new observations and 2) a static memory to balance new-old knowledge. When combined with a visual-/LiDAR- based SLAM system, the complete processing pipeline can help the agent incrementally update the place recognition ability, robust to the increasing complexity of long-term place recognition.

We demonstrate BioSLAM in two incremental SLAM scenarios. In the first scenario, a LiDAR-based agent continuously travels through a city-scale environment with a 120km trajectory and encounters different types of 3D geometries (open streets, residential areas, commercial buildings). We show that BioSLAM can incrementally update the agent’s place recognition ability and outperforming the state-of-the-art incremental approach, Generative Replay, by 24%. In the second scenario, a LiDAR-vision-based agent repeatedly travels through a campus-scale area on a 4.5km trajectory. BioSLAM can guarantee the place recognition accuracy to outperform 15% over the state-of-the-art approaches under different appearances. To our knowledge, BioSLAM is the first memory-enhanced lifelong SLAM system to help incremental place recognition in long-term navigation tasks.

Lifelong SLAM, Incremental Place Recognition, Continuous Localization

Fig. 1: Challenges in Real-world Robotic Localization. For real-world field applications, robotic localization usually encounters the following challenges: 1) changing appearance under long-term environmental variants, 2) diverse geometric differences under large-scale areas, 3) mixture structure/unstructured environments, and 4) non-stop restriction for long-term autonomy.

I Introduction

An essential capability for long-term robotics autonomy in the open world without human assistance is life-long Simultaneous Localization and Mapping (SLAM) [11]. In the context of lifelong SLAM, the system needs to consider work in long-term operation in large-scale environments and diverse environmental conditions, as depicted in Fig. 1. Current SLAM methods are mainly conducted under single-type environments, where the environmental conditions (such as illuminations, weather, seasons, etc.) are consistent, and these environments are mostly static. Recent works attempt to relax the single-type assumption to accommodate diverse environments by leveraging domain adaptation techniques into model learning with deep neural networks. However, the learned place descriptors under new scenarios can affect the localization accuracy of previous scenarios, an effect known as “catastrophic forgetting”.

In real-world long-term navigation, the robot may encounter complicated 3D environments, such as campus areas, open streets, residential blocks, commercial buildings, etc., and each place has its unique patterns in place recognition. The robot platform can’t collect datasets under all scenarios at once and train the localization module in a supervised manner. A naive solution for incremental observations is to source additional data for model adaptation with a new scenario; however, this adaptation is not feasible when the goal is to ensure the uninterrupted and long-term operation of the robot, since it causes catastrophic forgetting of previous knowledge. Moreover, changes in environments can be sudden, e.g., rapid illumination and weather changes, while it may take too long for traditional learning-based approaches to react to the changes. As depicted in Fig. 1, the main challenges for lifelong place recognition include:

Various environmental conditions: the appearances of the same area under different environmental conditions will be represented with different patterns.
Diverse scenarios: the robot platform will encounter different 3D environments in large-scale navigation tasks, and most areas are a combination of different types.
Non-stop training: for long-term autonomy challenges, the robot will accumulate new datasets, and model fine-tuning is usually required to improve localization performance for new scenarios.

With the above challenges, traditional SLAM methods mainly work for short-term navigation tasks and can hardly deal with long-term data association. The lack of domain adaptation in existing methods has become a major hurdle to achieving long-term robotic autonomy because robots will encounter boundless new scenarios in real applications. For most place recognition methods [17, 32], the addressed domain adaptation only considers unidirectional knowledge transfer from a single domain to another fixed domain, which cannot be generalized to open world situations, where new environments that the robot can encounter are infinite and previously known environments can be visited under diverse conditions.

In this work, we propose a lifelong localization system, BioSLAM, which defines a lifelong SLAM framework that can continuously adapt to new environments without sacrificing performance in previously seen environments. In our previous work [57], we notice that cross-domain appearance differences will significantly affect the localization performance; the localization module encounters the catastrophic forgetting problem, where it is only robust to the most recently trained scenarios. In contrast, humans and animals do not suffer from catastrophic forgetting, and short-term and long-term memory mechanisms exist within the hippocampus [41] and the front lobe of the brain [49], which plays the main role in lifelong knowledge updating. Recently, new evidence from fMRI studies in humans [7] finds that the hippocampus may ‘act as a librarian to retrieve the cortical books of memory’, i.e., the hippocampus can index the memories for fast retrievals. Inspired by the biological mechanism, we design two memory zones for BioSLAM, namely static memory zones (SMZ) for historical memory encoding with low frequency and dynamic memory zone (DMZ) for quickly memory reply, and propose a dual-memory selection mechanism to balance the short-term adaptation for new observations and long-term memory retention for historical knowledge. Specifically, BioSLAM also develops a sleeping cycle for memory consolidation within SMZ, which is also inspired by a similar mechanism in the hippocampus [30]. Based on the above mechanism, BioSLAM has the ability to achieve long-term place recognition.

The evaluation methods [60] for traditional place recognition using supervised learning approaches do not apply to lifelong systems. The performance of lifelong systems is reflected by the adaptation capability with respect to new observations and the long-term memory retention of previously visited areas. In this work, we formulate two metrics, namely adaptation efficiency (AE) analysis and retention ability (RA) analysis, and perform extensive evaluation using two long-term datasets: 1) City Dataset, which is focused on changing geometric patterns, and 2) Campus Dataset, which is focused on changing illumination patterns. The major contributions of this paper are as follows:

BioSLAM provides a systematic framework to learn about ever-changing environments without interruption. Using this framework, we enable the incremental place feature learning in the long-term autonomy.
Within BioSLAM, we develop a dual-memory module, which includes 1) a dynamic memory zone with high-frequency updates for fast adaptation of new patterns and 2) a static memory zone with low-frequency updates for long-term memory retention.
BioSLAM can perform non-stop online learning for new environments and provide lifelong re-localization ability for previously visited areas even under changing environmental conditions. Furtherly, the module design of BioSLAM makes it possible to be combined with arbitrary place descriptor learning modules.
We developed extensive lifelong localization datasets and relative metrics to evaluate the lifelong localization performance and demonstrate a detailed analysis of adaptation efficiency and long-term memory retention ability.

In the rest of the paper, we will introduce the related works for place recognition and lifelong incremental learning in section II. Section III gives the structural overview of BioSLAM. Section IV and section V explain the details of the general place feature learning and bio-inspired lifelong memory, respectively. The experiment setup and qualitative/quantitative analysis are given in section VI and section VII.

Ii Related Works

There are two important modules in lifelong localization: 1) place recognition and 2) lifelong learning. Place recognition (PR) or Loop closure detection (LCD) has been studied for decades, as stated in [37, 5, 64], which mainly serves as the data association for large-scale re-localization and map optimization in SLAM tasks. Lifelong learning, also known as continual, incremental, or sequential learning, aims at incrementally building up knowledge from a sequential data stream [16, 33], which is essential for long-term localization where robots will encounter many infinite environments. In the following subsections, we will mainly introduce the related works in visual/LiDAR place recognition and recent lifelong learning works from a robotics perspective.

Ii-a Long-term Place Recognition

Place recognition targets identifying the exact areas under different perspectives and environmental conditions [37]. There are mainly addressed with two types of approaches, namely visual-based and LiDAR-based place recognition.

For visual place recognition, the visual inputs are usually affected by illuminations and viewpoints. The traditional geometry descriptors (e.g., scale-invariant feature transform (SIFT) [39] and oriented FAST and rotated BRIEF (ORB) [43]) are widely used in visual place recognition because of their invariant properties to scale, orientation and illumination changes. Based on these handcrafted features, FAB-MAP [15] build a Bag-of-visual-words (BoW) architecture to achieve large-scale visual re-localization. iBoW-LCD [21] uses an incremental BoW scheme based on binary descriptors to retrieve matched images more efficiently. An et al. introduces FILD++ [1], an incremental loop closure detection approach via constructing a hierarchical small‐world graph. With the booming of deep learning, new convolutional neural network (CNNs) features, such VGG [47], ResNet [23], Transformer [53], provide significent improvements in feature/semantic extraction. NetVLAD [2] combined the CNN features and an differentiable VLAD [3] layer to enable deep learning for visual place recognition; and based on [2], recent deep learning approaches [26, 19, 25] further improve the recognition accuracy by combining with different networks.

For LiDAR-based place recognition, LiDAR inputs will not be affected by environmental conditions, such as illumination, weather, and season differences. In non-learning based 3D localization, M2DP [24] and Scan-context [28] utilize the histogram of LiDAR projection to achieve long-term 3D re-localization. With the developments of 3D deep feature extraction, recent learning-based approaches have also gotten increasing attention. PointNetVLAD [50] combines the point-based feature extraction and VLAD layer for 3D place recognition. LPDNet [35] extend [50] by including local geometric features. PCAN [63], from another perspective, uses the attention-enhanced VLAD layer to improve feature association for accurate localization. OverlapNet [12] provides a differentiable projection layer to estimate the similarity of local 3D sub-maps. In our previous works, FusionVLAD [58] provides a fusion based approach to improve the features adaptation between different perspectives; and SphereVLAD [55] provides a viewpoint-invariant place descriptor by combining spherical harmonics [56] and sequential matching [40].

Despite the success of existing place recognition methods, the non-learning-based approaches are sensitive to parameter tuning under different scenarios; and learning-based techniques are trained in a supervised learning manner, restricting their generalization ability within the offline training datasets. However, in real-world localization tasks, the data stream is infinite with the combination of different areas under varying environmental conditions; meanwhile, robotic systems can’t stop and wait for the network model to update for newly encountered scenarios. In this work, we target lifelong learning, where the place observations can be viewed only once in the sequential order [16]. In lifelong localization, one important evaluation metric is to analyze the re-localization ability after long-term navigation (i.e., catastrophic forgetting), while most exciting place recognition methods mainly focus on short-term localization or fixed pattern localization [60]. To the best of our knowledge, this is the first work to handle real-world long-term/large-scale lifelong localization.

Ii-B Lifelong Learning for Robotics

Lifelong learning, also known as continual learning, aims at providing incrementally updated knowledge in ever-changing environments. Though this area has been studied for a long time, most approaches are still restricted to simulation or toy datasets [16], and can not be applied in real robotic applications. As mentioned in [33], the fundamental challenge for lifelong learning is not necessarily finding solutions that work in the real world but rather finding stable algorithms that can learn in the real world and overcome the catastrophic forgetting problem. Recent works can be roughly divided into four families: dynamic architectures, regularization-based, rehearsal, and generative replay approach.

The lifelong localization system contains the following modules:
1) the general place encoder within the BioSLAM network, which extracts the place feature from different domains;
2) the dual-memory lifelong learning mechanism within the BioSLAM network can provide short-term and long-term assistance to capture new knowledge and maintain old knowledge. — Fig. 2: Lifelong Localization System Framework. The lifelong localization system contains the following modules: 1) the general place encoder within the BioSLAM network, which extracts the place feature from different domains; 2) the dual-memory lifelong learning mechanism within the BioSLAM network can provide short-term and long-term assistance to capture new knowledge and maintain old knowledge.

Dynamic architecture-based methods either 1) add additional parameters to the models, such as LwF [34], which use shared early feature extraction layers and fixed task layers; or 2) use model adaptation to avoid catastrophic forgetting, such as PackNet [38], which defines the mask layer to protect weights when learning new tasks. Regularization-based methods in the context of lifelong learning can add constraints to avoid overfitting to new tasks and keep inference ability for the previous mission, such as Elastic Weight Consolidation (EWC) [29] and Synaptic Intelligence (SI) [61]. However, the above methods must deal with specific network structures and can quickly converge to undesired local optima for complex tasks. Rehearsal-based methods, on the other hand, use memory replays to enhance the knowledge from the previous tasks or processes such as iCaRL [42], GEM [36], which use a small subset of the previous dataset to balance the knowledge distribution for different tasks. Instead of maintaining the knowledge based on past data samples, generative replay [52] combines the actual raw data and generated artificial data for model updating. In [13], the authors use a dual teacher-student generative replay method for incremental learning, where the teacher network is frozen to guide new networks, and the networks will switch the role when the student network surpasses the teacher.

Hence, the ideal approach would be tackling the real-world localization problem in an embodied platform: an autonomous agent that can efficiently and incrementally update its localization ability with limited computation resources. Our BioSLAM method combines rehearsal, general replay mechanisms, and a specific dynamic and static memory to tackle long-term complex environments.

Iii Problem Formulation & System Overview

In this work, BioSLAM represents an incremental place recognition method, which includes: 1) a general place feature extraction module to encode place features under different domains, and 2) the bio-inspired lifelong memory system for online adaptation in place recognition. In this section, we will first formulate the problem in lifelong localization, then briefly introduce the two key modules in the BioSLAM system.

Iii-a Problem Formulation

We define a sequence of place observations under domain $D$ (i.e., visual, LiDAR, etc.) as $O^{D} = {O_{1}, . . ., O_{M}}$ , and a query of observations under the same domain as $Q^{D} = {Q_{1}, . . ., Q_{N}}$ . The task of traditional place recognition is to learn a feature extraction function $F$ with parameter $θ$ to help each frame in $Q^{D}$ find the matched (positive) place from the reference sets $O^{D}$ . Let $d (\cdot, \cdot)$ denotes the difference matrix (i.e. Euclidean distance). The objective is to make the feature differences of positive places (or matched) smaller than negative places (or unmatched) by feature extraction function $F_{θ^{*}}$ .

	$L (Q_{k}^{D})$	$= d (F_{θ} (Q_{k}^{D}), F_{θ} (O_{\approx}^{D})) - d (F_{θ} (Q_{k}^{D}), F_{θ} (O_{\neq}^{D}))$
	$θ^{*}$	$= arg min θ N \sum k = 1 L (Q_{k}^{D})$		(1)

where $Q_{k}^{D}$ is the current $k$ -th query, $O_{\approx}^{D}$ is the positive reference within a predefined neighbor range (e.g. 3m) near $Q_{k}^{D}$ , and $O_{\neq}^{D}$ is the negative reference away from $Q_{k}^{D}$ more than a predefined threshold (e.g. 10m).

In lifelong localization, both reference set $O_{t}^{D}$ and query set $Q_{t}^{D}$ are obtained incrementally, and the environmental domains can also be varying under different environmental conditions (illuminations, weathers, etc.) or sensor modalities. As depicted in Fig. 2, the lifelong localization problem is to incrementally learn and update the feature extraction function $F_{θ}$ , that can quickly adapt its feature extraction ability in the newest domain ${O^{D_{T}}, Q^{D_{T}}}$ , and also in parallel maintain the feature distinguish ability for previous domains ${O^{D_{t}}, Q^{D_{t}}} |_{t = 1, . . ., T - 1}$ . Since we are considering continual learning [16], raw data of different domains are fed sequentially for one-time usage and cannot be stored for offline training. Thus, when optimizing feature extraction function $F_{θ_{T}}$ at the current domain ${O^{D_{T}}, Q^{D_{T}}}$ , we cannot access the previous raw data from ${O^{D_{t}}, Q^{D_{t}}} |_{t = 1, . . ., T - 1}$ . Recall Eq. 1, for time step $T$ , lifelong localization can be formulated as,

	$L^{t} (Q_{k}^{D_{t}})$	$= d (F_{θ} (Q_{k}^{D_{t}}), F_{θ} (O_{\approx}^{D_{t}})) - d (F_{θ} (Q_{k}^{D_{t}}), F_{θ} (O_{\neq}^{D_{t}}))$
	$θ_{T}$	$= arg min θ T \sum t = 1 N \sum k = 1 L^{t} (Q_{k}^{D_{t}})$		(2)

Iii-B General Place Feature Extraction

For the lifelong purpose of long-term localization, we developed a General Place Descriptor (GPD) based on our previous works in visual [57] and LiDAR-based [56] localization, which can be referred to as the feature extraction function $F_{θ}$ as in Eq. 2. We use the shared spherical convolution network to achieve LiDAR and visual place localization simultaneously. The spherical harmonic-based convolution can help the learned descriptor have the viewpoint-invariant propriety for the same place recognition. The major difference between the current GPD and our previous works is that GPD does not contain any domain-transfer module, which has been used to reduce the feature differences for the same areas under different domains [57]. This modification is because we want to evaluate the adaptation ability for the same network. On the other hand, there are no task-specific network layers as used in the dynamic architecture-based lifelong modules as stated in section II-B. We want to avoid uncertain parameters and only focus on how memory mechanisms can help incremental learning for real-world applications.

The structure of BioSLAM includes the General Place Learner (GPL) network, the rewarding mechanism to guide the memory storing and consolidation, and the dual-memory module with static-/dynamic- memory zones.
The procedure of new memory encoding includes the following procedure:
1) new observations — Fig. 3: BioSLAM Network Structure. The structure of BioSLAM includes the General Place Learner (GPL) network, the rewarding mechanism to guide the memory storing and consolidation, and the dual-memory module with static-/dynamic- memory zones. The procedure of new memory encoding includes the following procedure: 1) new observations $q_{k}$ are fed into the networks for only one time and in a sequential manner; 2) In GPL, the memory encoder converts the inputs $q_{k}$ to encoded memory $z_{k}$ , followed by spherical convolution and VLAD layer to generate place feature descriptor $F_{k}$ . 3) the rewarding mechanism will estimate the external reward $R_{e x}$ and internal rewards $R_{i n}$ to guide the memory operations; 4) the dual-memory will conduct memory storing/consolidation and retrieve (replay) the important (high-rewarded) memory for generative replay. 5) updating the GPD network module with both observations and replayed memories.

Iii-C Bio-inspired Lifelong Memory

Inspired by the memory system in human-being and other mammal animals [7], we provide a dual-memory (i.e., dynamic memory zone (DMZ) and static memory zone (SMZ)) enhanced lifelong learning mechanism to deal with catastrophic forgetting in continual localization. As studied in [48], to create long-term memories in our brain, we have so-called sleep circle during our sleep:

1) the brain can encoding our daily observation into the hippocampus zone with decay along the time;
2) then, a consolidation mechanism is triggered between the hippocampus and the neocortex to store essential memory traces and forget the rest traces;
3) finally, humans can retrieve the relative memory traces based on the consolidated ones in the neocortex.

In BioSLAM, we re-build the ‘sleep circle’ for the lifelong localization task. As we can see in Fig. 2, the memory system of BioSLAM also includes the place feature ‘encoding’ procedure for new observations, the memory ‘consolidation’ controlled by a behavior cost module to filter out necessary traces for more extended storage, and the ‘retrieved’ memory to re-enhance the long-term place recognition ability. Based on the above architectural, BioSLAM system construct two major systems, the General Place Learning (GPL) system and the Bio-inspired Lifelong Memory (BiLM) system, which will be investigated in section IV and section V respectively.

Iv General Place Learning

As shown in Fig. 3, the general place learning (GPL) (blue dashed box) system mainly contains two sub-modules: a place memory encoding module (upper part of the blue dashed box) and a generative memory reply module (lower part). All the data under different domains (LiDAR inputs or visual inputs on day/night conditions) are fed into the system sequentially once during the online training procedure. The GPL system uses the symmetric encoder-decoder networks to encode new observations and decode memory traces. In this section, we introduce the design of the encoder, the decoder, and the place feature learning within the GPL system and leave the memory system in the next section.

Iv-a Place Memory Encoding

GPL applies the encoder module $E$ to convert raw sensor observations into the ‘memory codes’ with VGG [47]-based networks, which are also the basic materials in the BiLM system. In parallel, GPL constructs the decode module mirrored to the encode module, which can reconstruct the stored memory into the synthetic observations. Since in lifelong localization, both viewpoint difference and environmental appearance changes will affect the final localization performance in real-world applications. Based on orientation-equivalent property of spherical harmonics, we utilize the spherical convolution [57, 56] in the encoder module to provide viewpoint-invariant descriptor to reduce the viewpoint differences in long-term re-localization. As shown in Fig. 3, the extracted place feature descriptor is not involved with the BiLM memory system, which indicates that our lifelong localization is mainly designed for the long-term domain differences, which also helps simplify quantitative and qualitative analysis.

The GPL system encodes both panorama camera and 3D local point cloud with the same encoding network structure $E$ . For the visual inputs, we convert the raw image to $[H \times W]$ spherical perspectives, which are fed into the encoder module for memory storage and place descriptor extraction. For the LiDAR inputs, instead of a single scan, we generate dense local 3D maps using the similar voxel mapping mechanism in our previous work [58] and map the points onto the spherical projections, which have the same omnidirectional view as a panorama camera. The default size for visual and LiDAR views is $H = W = 64$ , with $3$ channels for visual and LiDAR inputs (repeated three depth channels ranging from $0$ to $1$ ). We can obtain the encoded ‘memory’ $z_{k}$ from observation $q_{k}$ by,

z_{k} = E (q_{k})

(3)

To extract the orientation-invariant place descriptor from $z_{k}$ , we utilize the spherical convolution based on the spherical harmonics [18]. In theory, Spherical convolution can avoid space-varying distortions in Euclidean space by convolving spherical signals in the harmonic domain. Let $f$ is the signal on spherical harmonic, which satisfy the orientation-equivariant [14] property with the signal $E$ ,

[f ⋆_{S O (3)} [H_{R} E]] (q_{k}) =

[H_{R} [f ⋆_{S O (3)} E]] (q_{k})

(4)

where $H_{R} (R \in S O (3))$ is the rotation operator for spherical signals. $f ⋆_{S O (3)} E$ denotes the spherical convolution between $f$ and $E$ . Practically, the spherical convolution is computed in three steps. We first expand $f$ and $E (q_{k})$ to their spherical harmonic basis, then compute the point-wise product of harmonic coefficients, and finally invert the spherical harmonics.

Let $V$ be the unsupervised VLAD layer [2]. Then the feature extraction function $F_{θ} = V \circ [f ⋆_{S O (3)} E]$ is the orientation-invariant function. Where $θ$ are learnable parameters of the feature extraction function. Given the data sample $q_{k}$ , the place descriptor (or learned feature) $F_{k}$ can be denoted as

F_{k} = F_{θ} (q_{k}) = V \circ [f ⋆_{S O (3)} z_{k}]

(5)

For details about the viewpoint-invariant analysis, please refer to our previous works [57, 56].

The above procedure is relevant to the biological ‘encoding’ procedure within the ‘sleep cycle’ as we mentioned section III, and the extracted ‘memory codes’ $z_{k}$ will be used for later ‘memory consolidation’ in next section V and ‘retrieval’ for generative replay in next section IV-B

Iv-B Generative Memory Replay for Place Recognition

As depicted in Fig. 3, in our generative memory replay for the lifelong localization task, the data stream under different environmental conditions is fed into the system sequentially, and we generate synthetic samples from stored place memories through a deep generative memory replay framework. In particular, the retrieved ‘memories’ will include the abstracted place latent codes under different domains as shown in Fig. 4, which enforce the generative memory play to extract a portion of history samples under all different domains to maintain the localization performance. The next section relates the memory extraction mechanism to the BiLM memory management.

To ensure the generalization ability of synthetic samples, we provide a deep generative adversarial network (GANs) to mimic the distribution differences between the raw data and synthetic samples and parallel with a $L 1$ reconstruction loss between encoder and decoder modules. The GANs-based generative model defines a zero-sum minimax game with the memory decoder $G$ and the discriminator $D$ as stated in [22], the objective function is thereby defined by,

	$min G max D L_{g a n} (G, D) =$		(6)
	$min G max D E_{q \sim P_{d a t a}} [log D (q)] + E_{z^{'} \sim P_{z}} [log (1 - D (G (z^{'})))]$

where $P_{z}$ is the retrieved memory buffer from the BiLM system, and $P_{d a t a}$ is the new observed data samples. $L_{g a n}$ denotes the generator and discriminator losses [22]. The detailed generative play strategy for lifelong learning can be found in [46, 51].

Given the combination of both new streaming data and retrieved synthetic samples ${P_{d a t a}, P_{z}}$ , we can obtain observation pair ${Q, O}$ with tuple sets $(q_{k}, {o_{k}^{p o s}}, {o_{k}^{n e g}})$ , where for each query sample $q_{k}$ we have a set of potential positives (close-by samples ) ${o_{k}^{p o s}}$ and the set of negatives (far away samples) ${o_{k}^{n e g}}$ . The localization loss metric is defined by:

	$L_{l o c} (q_{k}) =$	(7)
$max i, j$	$(∥ F (q_{k}) - F ({o_{k}^{p o s}}_{i}) ∥^{2} + α - ∥ F (q_{k}) - F ({o_{k}^{n e g}}_{i}) ∥^{2}, 0)$
	$L_{l o c} = E_{q_{k} \sim P_{d a t a}} [L_{l o c} (q_{k})] + E_{z^{'} \sim P_{z}} [L_{l o c} (G (z^{'}))]$	(8)

the above equation is the triplet loss version of Eq. 1, where $L_{l o c} (q_{k})$ is the localization loss metric for single query $q_{k}$ , ( $α > 0$ ) is a margin to control the feature difference threshold, and $L_{l o c}$ is the localization loss for the joint ${P_{d a t a}, P_{z}}$ sets.

Fig. 4: Generative Memory Reply within the GPL system. In the lifelong localization, new observations $O^{D_{t}}$ under domains $D_{t}$ will be streamed into the BioSLAM system sequentially. GPL’s generative memory replay module can generate synthetic samples $G (E (O^{D_{t}}))$ from stored memories.

To keep the consistency of memory encoding-decoding, we further use a reconstruction loss of the memory $z^{'}$ and the generated memory $E (G (z^{'}))$ with,

L_{r e c} = E_{z^{'} \sim P_{z}} [∥ E (G (z^{'})) - z^{'} ∥]

(9)

And the joint loss metric for the generative memory replay enhanced place recognition can be written as,

L_{j o i n t} = L_{l o c} + L_{r e c} + L_{g a n} (G, D)

(10)

The major difference between our work and the traditional generative play [46] is that BioSLAM can manage the retrieved memory based on their long-term behavior instead of treating all data on the same manifold distribution. The next section will deeply investigate the lifelong memory system.

V Bio-inspired Lifelong Memory

As we analyzed in section II-B, most current lifelong learning methods target toy examples, which can not be generalizable under complex real-world environments. In our BioSLAM, as shown in the greed dashed box of Fig. 3, the bio-inspired lifelong memory system mainly contains two modules: 1) a behavior configurator module to arrange memory consolidation and selection, which is based on memory traces’ importance (measured by reward calculation) to the long-term place recognition. and 2) a dual-memory module to cooperate with the behavior configurator for long-/short- term memory storage and importance-retrieval with limited space usage; which includes static memory zone and dynamic memory zone.

V-a Behavior Configurator

When the memory system encounters a new place ‘memory traces’ $z$ , we define the hybrid cost to control the learning behavior: an external reward $R_{e x}$ which indicates localization ability, and the internal reward $R_{i n}$ which can present the intrinsic familiarity on observations.

V-A1 External Reward

The external reward is related to the learning difficulty of new data samples, which indicates the distinguishing ability in the place recognition task. In the standard training paradigm, all samples under different difficulty levels are almost equally used to optimize the training model. However, Humans and animals always spend more energy and time learning more complex concepts. Inspired by the animal training [31] and curriculum learning [6], it is practically useful to sort the data samples into different difficulty level, i.e., “easy”, “medium” and “hard”. For lifelong localization, the “hard” samples may need more ‘energy’ to encode in the training model $F$ , i.e., more retrieval times as in the replay procedure as stated in section IV-B. To encourage the “harder” samples to have a higher chance of re-training, we define the triplet loss to measure features’ distinguishability. Based on place recognition loss metric $L_{l o c}$ , we define the external reward for every single query as,

R_{e x} (q_{k}) = L_{l o c} (q_{k})

(11)

which means that if query $q_{k}$ has a higher loss than other queries, it will require more ‘energy’, i.e., more iteration times in model training. “Harder” samples will tend to higher $R_{e x}$ , then the BiLM memory system will have a higher chance to retrieve them for memory replay as stated in section V-B4.

V-A2 Internal Reward

The internal reward is related to the robustness of feature representations. Let $A (q_{k})$ denote the data augmentation (i.e. random rotation and random translation) for query $q_{k}$ . The internal reward $R_{i n}$ for query $q_{k}$ is defined by the cosine distance of features between its augmented version,

R_{i n} (q_{k}) = 1 - \frac{E (q_{k}) \cdot E (A (q_{k}))}{∥ E (q_{k}) ∥_{2} \cdot ∥ E (A (q_{k})) ∥_{2}}

(12)

The internal reward $R_{i n}$ also indicates the network’s familiarity with the observations. This is common in large-scale structured areas, such as LiDAR-based place recognition under city-scale environments. In that case, similar place patterns (street view, buildings, trees) can be frequently visited with different views; the $E$ has a robust representation and lower internal reward of frequently visited places. Thus, the inner reward $R_{i n}$ can be applied as an indicator to guide the memory system on whether or not to pay more attention to such areas. The inner reward can provide intrinsic property analysis for the memory encoder based on the above analysis.

The final reward for $q_{k}$ can be obtained by combining the external reward with the internal reward,

R_{k} = R_{e x} (q_{k}) + R_{i n} (q_{k}),

(13)

Based on this rewarding mechanism, we can evaluate all the queries, and obtain a set of memory trace $m_{k} = (z_{k}, p_{k}, R_{k})$ , where $p_{k}$ is estimated location of $q_{k}$ through our previous re-localization system [57, 54]. $m_{k}$ is then the main factor used in memory operations of section V-B.

V-B Dual-Memory & Memory Operations

The memory of human beings is highly connected with long-term memory (the neocortex) and short-term memory (the hippocampus) mechanisms within our brains. BioSLAM also constructs such paired dual-memory mechanisms,

Static Memory $M_{S}$ is similar to the long-term memory of human beings and belongs to rehearsal-based mechanisms with large storage for lifelong learning. Static memory stores the selected memory traces ${m_{k}}$ to the static memory zone by memory consolidation.
Dynamic Memory $M_{D}$ is similar to the short-term memory of human beings, which is a quick access memory with a portion of pre-stored historical memory traces. Dynamic memory is automatically refreshed from the static memory and connected with the memory decoder module, which belongs to generative replay mechanisms with small memory buffer for lifelong learning.

Based on the dual-memory structure, we construct two important operations for static memory: memory consolidation and forgetting, and two important operations for dynamic memory: memory refreshing and memory replay.

Input: Static memory

M_{S}

, new memory traces

{m_{k}}

, maximum number of clusters

K_{m a x}

Output: Updated static memory

1 Construct feature-spatial codes

{c_{k}} = {z_{k}, p_{k} | m_{k}}

;

2 Calculate clusters

{S_{i}^{T}} |_{i = 1}^{K}

and centroids

{μ_{i}^{T}} |_{i = 1}^{K}

for

{c_{k}}

based on Eq. 14;

3 Downsample within clusters, based on Eq. 15, to generate smaller clusters

{{~ S}_{i}^{T}} |_{i = 1}^{K}

and centroids

{{~ μ}_{i}^{T}} |_{i = 1}^{K}

;

4 Append new clusters

{{~ S}_{i}^{T}} |_{i = 1}^{K}

and centroids

{{~ μ}_{i}^{T}} |_{i = 1}^{K}

M_{S}

;

5 Calculate the total cluster number

C^{(M_{s})}

M_{S}

;

6 if $C^{(M_{s})} > K_{m a x}$ then

7 Memory Forgetting with Algorithm 2;

9Update static memory

M_{S}

;

Algorithm 1 Memory Consolidation

V-B1 Static Memory Consolidation

As stated in [10], memory consolidation is defined as a time-dependent process by which recently learned experiences are transformed into long-lasting forms to extend the long-term memory circle. In the long-term and large-scale place recognition task, the observations may include differences in the spatial domain (Euclidean distance) and feature domain (feature distance). The data stream is also unlimited in the real-world navigation task. Memory consolidation is essential to abstract concise representations and guarantee memory efficiency. To provide memory consolidation within BioSLAM system, we construct a feature-spatial code $c_{k} = [z_{k}, p_{k}]$ for memory trace $m_{k}$ , which can capture both spatial and feature properties.

Given time step $T$ and observations ${q_{k}}^{T} \subset Q^{D_{T}}$ , the obtained new memory traces ${m_{k}}^{T} = {(z_{k}, p_{k}, R_{k})}^{T}$ ( Eq. 13 and 3,) usually contains a large number of samples. We get the diverse and smaller subset of memory traces (abstraction) via K-means-based unsupervised clustering. K-means clustering partition the ${m_{k}}^{T}$ into $K$ sets $S^{T} = {S_{1}^{T}, S_{2}^{T}, \dots, S_{K}^{T}}$ by feature-spatial code $c_{k} = [z_{k}, p_{k}]$ to minimize the following,

	$S^{T}$	$= a r g m i n S^{T} k \sum i = 1 \frac{1}{\| S_{i}^{T} \|} \sum c_{x}, c_{y} \in S_{i}^{T} ∥ c_{x} - c_{y} ∥^{2}$		(14)
	$μ_{i}^{T}$	$= \frac{1}{\| S_{i}^{T} \|} \sum c_{i} \in S_{i}^{T} c_{i}$

where $μ_{i}^{T}$ is the cluster centroid for cluster $S_{i}^{T}$ . The memory traces of a cluster can be served as mutually homogeneous. Storing all memory traces from a cluster is redundant and memory exhaustive. Thus down-sampling is used within clusters $S_{i}^{T}$ to restrict the number of samples for each cluster, which is also helpful in improving the memory retrieval efficiency.

({~ S}_{i}^{T}, {~ μ}_{i}^{T}) = d o w n s a m p l i n g (S_{i}^{T}, μ_{i}^{T}), | {~ S}_{i}^{T} | < N_{m a x}

(15)

Where $N_{m a x}$ is the predefined threshold for the maximum number of samples in each cluster. After sampling, we generate $K$ smaller clusters ${~ S}^{T} = {{~ S}_{1}^{T}, {~ S}_{2}^{T}, \dots, {~ S}_{K}^{T}}$ and centroids ${~ μ}^{T} = {{~ μ}_{1}^{T}, {~ μ}_{2}^{T}, \dots, {~ μ}_{K}^{T}}$ from subset of new traces ${m_{k}}^{T}$ .

BioSLAM can combine the current new clusters ${~ S}^{T}$ with the existing clusters from previous steps, then the total clusters in the static memory are $S^{(M_{s})} = {{~ S}^{1}, {~ S}^{2}, \dots, {~ S}^{T}}$ , and the centroids are $μ^{(M_{s})} = {{~ μ}^{1}, {~ μ}^{2}, \dots, {~ μ}^{T}}$ . When the total cluster number $C^{(M_{s})} = | μ^{(M_{s})} |$ is beyond the maximum threshold $K_{m a x}$ , some similar clusters are merged to avoid the memory overflow described in the Memory Forgetting section V-B2. The consolidation mechanism is shown in Algorithm 1.

Input: Static memory

M_{S}

, maximum number of clusters

K_{m a x}

1 Load clusters

S^{(M_{s})}

and centroids

μ^{(M_{s})}

from static memory

M_{S}

;

2 Calculate the number of forgettable clusters

K^{*} = | μ^{(M_{s})} | - K_{m a x}

;

3 Calculate the distance matrix

d_{(i, j)}

between every two clusters based on Eq. 16 ;

4 while repeat $K^{*}$ times do

5 Find most similar cluster pairs

(i^{*}, j^{*})

based on Eq. 17 ;

6 Remove cluster

i^{*}

from static memory

M_{S}

and distance matrix

d_{(i, j)}

;

Algorithm 2 Memory Forgetting

V-B2 Static Memory Forgetting

As stated in the last section, the space within static memory is bounded in long-term lifelong learning. As a core operation in static memory, memory forgetting is designed to eliminate the redundant memory clusters when they are too similar to the other existing clusters. If the current cluster number is bigger than the maximum number of clusters $C^{(M_{s})} > K_{m a x}$ , memory forgetting mechanisms remove number of $K^{*} = C^{(M_{s})} - K_{m a x}$ clusters. We first calculate the cluster similarity based on the distance matrix $d_{(i, j)}$ between every two cluster centroids.

d_{(i, j)} = ∥ μ_{i} - μ_{j} ∥, \forall μ_{i}, μ_{j} \in μ^{(M_{s})}

(16)

Then we find corresponding cluster pairs $(i^{*}, j^{*})$ with the minimum distance and remove one of the clusters $i^{*}$ from the selected pairs.

(i^{*}, j^{*}) = a r g m i n i, j d_{(i, j)}

(17)

The removal process will be repeated $K^{*}$ times. In this manner, we can efficiently keep the diversity of memory clusters and eliminate the ‘redundant’ clusters. The memory forgetting mechanism is shown in Algorithm 2.

V-B3 Dynamic Memory Refreshing

Dynamic memory is brief and storage-limited, just like the short-term memory of humans. In order to effectively replay important memory traces from dynamic memory, we need to refresh dynamic memory and convert memory traces from static memory to dynamic memory at some frequency. In memory refreshing mechanisms, dynamic memory $M_{d}$ obtain memory traces ${m_{k}}$ from static memory $M_{s}$ by importance sampling,

	$M_{d}$	$=importance_sampling({mk},{wk})$		(18)
	$m_{k}$	$= (z_{k}, p_{k}, R_{k}) \sim M_{s}, w_{k} = γ^{n (m_{k})} \cdot R_{k}$

where importance weights $w_{k}$ are determined by the reward $R_{k}$ and the time-decaying factor $γ^{n (m_{k})}$ . $γ$ ( $0 \leq γ \leq 1$ ) is a predefined decay parameter, and $n (m_{k})$ denotes the replayed time (or revisited time) for the trace $m_{k}$ . On the one hand, traces with higher rewards have higher sampling weights. Because higher rewards mean lower localization ability and robustness, BioSLAM need to pay more attention to these samples. On the other hand, new traces have higher sampling weights. Because the network’s ability to learn samples with many occurrences has reached an upper limit, there is no need to spend precious dynamic memory to store samples that have been replayed many times. The decaying mechanisms also encourage dynamic memory to increase curiosity about new traces. The above reward decay mechanisms are inspired by the decaying factor in human memory [8], which indicates that repeated learning of the same things will decrease the boost in memorization.

V-B4 Dynamic Memory Replay

During lifelong learning, BioSLAM retrieves memory traces from dynamic memory for the generative replay training as stated in section IV-B. In dynamic memory replay, we still use importance sampling to obtain replayed memories ${z_{k}^{'}}$ from dynamic memory $M_{d}$ with the same as the refreshing memory mechanisms.

	${z_{k}^{'}}$	$=importance_sampling({zk},{wk})$		(19)
	$m_{k}$	$= (z_{k}, p_{k}, R_{k}) \sim M_{d}, w_{k} = γ^{n (m_{k})} \cdot R_{k}$

Then we use memory decoder $G$ to generate replayed samples ${^qk}$ from memories ${z_{k}^{'}}$ ,

{^q}_{k} = G (z_{k}^{'})

(20)

Both new observations ${q_{k}}$ and generated samples ${^qk}$ are used to train General Place Learner (GPL) network with minimizing the total loss Eq. 10. The overall lifelong learning algorithm of BioSLAM is shown in Algorithm 3.

Input: Initial place feature extraction model

F_{θ}

with parameters

θ

, Initial static memory

M_{s} = \emptyset

and dynamic memory

M_{d} = \emptyset

1 for $T = 1, 2, \dots$ do

2 Obtain observation set

{q_{k}}

;

3 while repeat until converge do

4 Generate replayed samples

{^qk}

from dynamic memory based on Eq. 20 and 19 ;

5 Calculate loss

L_{j o i n t}

using real samples

{q_{k}}

and replayed samples

{^qk}

based on Eq. 10 ;

6 Calculate gradient

\frac{d L_{j o i n t}}{d θ}

then optimize

F_{θ}

with parameters

θ

by gradient descend ;

8 Calculate rewards

{R_{k}}

of observations

{q_{k}}

based on Eq. 13 ;

9 Static memory

M_{s}

consolidation based on Algorithm 1 ;

10 Dynamic memory

M_{d}

refreshing based on Eq. 18

Algorithm 3 Lifelong Learning with BioSLAM

The platform can record the omnidirectional visual inputs, Velodyne VLP-16 LiDAR inputs, and Xsens MTI IMU data on an Nvidia Jetson AGX Xavier.
We utilize the LiDAR odometry — Fig. 5: Data Collection Platform. The platform can record the omnidirectional visual inputs, Velodyne VLP-16 LiDAR inputs, and Xsens MTI IMU data on an Nvidia Jetson AGX Xavier. We utilize the LiDAR odometry [62] to generate the relative odometry for each trajectory and GNSS or Generalized-ICP [44] to estimate the relative transformation between different trajectories.

Fig. 6: City Dataset for Lifelong Localization. The City dataset includes 50 trajectories (110 km) within the city of Pittsburgh. The dataset includes three areas (colored in blue, yellow, and red) covering commercial buildings, parks, and residential areas.

Fig. 7: Campus Dataset for Lifelong Localization. For Campus dataset, omnidirectional camera and LiDAR data are recorded for 2D-to-2D and 2D-to-3D place recognition within CMU. The campus datasets are generated during $08 / 2021 \sim 10 / 2021$ , which are mainly taken from normal day-light ( $2 p m \sim 5 p m$ ) and dawn-light ( $5 a m \sim 6 a m$ or $7 p m \sim 8 p m$ ).

Vi Experiment Setup and Criteria

In this section, we will introduce the experiment setup for lifelong localization. Different from traditional localization tasks, lifelong localization requires the recorded data includes either long-term differences or large-scale geometric differences. And based on the above reasons, we built our own data collection platform and own lifelong localization datasets. We will also briefly describe our evaluation metrics.

Vi-a Data Collection Platform

Fig. 5 shows our data collection platform, which includes an omnidirectional camera, a Velodyne VLP-16 LiDAR device, an inertial measurement unit (Xsense MTI $30$ , ${0.5}^{\circ}$ error in roll/pitch, $1^{\circ}$ error in yaw, $550 m$ W), and an embedded GPU device (Nvidia Xavier, $8$ G memory). To collect time-synced LiDAR projection and omnidirectional images, we first generated dense 3D maps through well-known LiDAR odometry [62]. Then project the point cloud within a certain distance (default is $30 m$ ) to the spherical projections, which have the same perspective as the omnidirectional images. We will revisit the same area under large-scale and long-term assumptions in lifelong localization. To provide the relative ground truth position between different visits: to outdoor environments, we rely on the GNSS system and Generalize-ICP [44] to estimate the relative transformation; For indoor environments, we mainly rely on Generalize-ICP. Please note that we cannot guarantee the meter-level global absolute localization, but we can provide accurate relative localization, which is enough for the lifelong localization task. Based on the collected datasets, we have hosted a General Place Recognition Competition for long-term place recognition. For more details on the data collection platform and the datasets, please refer to our dataset paper(https://github.com/MetaSLAM/ALITA) and competition site (http://gprcompetition.com/).

Dataset	Environments	Scales (km)
City	Street, Residential, Terrain	$120 \times 1$
Campus	Campus area	$4.5 \times 8$

TABLE I: Comparison between different datasets.

Vi-B Lifelong Localization Datasets

We intend to analyse the lifelong performance under large-scale and long-term two perspectives. To this end, our localization datasets include two tracks:

City dataset: shown in Fig. 6, which is targeting at large-scale lifelong performance. We collected $50$ trajectories within the city of Pittsburgh. Since we mainly care about large-scale localization, we only collected the LiDAR inputs within a short-term drive. The total trajectory distance for this dataset is $110$ km.
Campus dataset: shown in Fig. 7, which is targeted at long-term lifelong performance. We picked up $10$ trajectories within Carnegie Mellon University. Each trajectory is revisited by $8$ times under different day- and night- time to satisfy the long-term requirements.

For both datasets, we feed the sequential data stream into the BioSLAM training procedure as depicted in Fig. 2. Please note that each data will be only fed into the system once, and BioSLAM will not save the copy of that data sample. For Campus dataset, the LiDAR inputs, day- and night- visual inputs are fed into the system one by one. For City dataset, we will only feed the continuous LiDAR inputs to the system.

Vi-C Performance Evaluation

To evaluate the localization performance on the large-scale City dataset and long-term Campus dataset in incremental learning, we divide the trajectory into individual trajectory segments and feed them into different place recognition systems in a sequential manner. We evaluate the online localization performance mainly through Weighted Recall (WR) of top-6 retrievals over incremental training, which is defined by $WR = \sum_{k = 1}^{6} ω_{k} r_{k}, ω_{1} = 0.5, ω_{k} = 0.1 f o r k \neq 1$ , where $r_{k} (1 \leq k \leq 6)$ denotes the recall@k. Recall@k is the proportion of matched references found in the top-k retrieval. Specifically, the query image is deemed correctly localized (matched) if at least one of the top $k$ retrieved reference images is within the predefined neighbor range (e.g. 3m) from the ground truth position.

Place recognition methods incrementally trained on trajectory observations from 3 different areas. The shaded region shows the standard deviation. — (a) Weighted recall in different areas

Vi-D Baselines

Since our place recognition task involve different sensor modalities, non-learning/learning methods, and lifelong/non-lifelong methods, it is impossible to cover all the relevant state-of-the-arts. We focus on the performance comparison from a 2D perspective and ignore Point-like [50] 3D methods. As a comparison, we select the following well-known non-learning methods (Bag-of-wards (BOW) [20], CoHOG [59]), learning-based methods (NetVLAD [2], RegionVLAD [27]) and lifelong-based methods (Generative Replay (GR) [45], Synaptic Intelligence (SI) [61]). Among the above methods, GR and SI are the most related and important baselines to BioSLAM. Although BioSLAM and GR both use memory replay, BioSLAM has more efficient and effective memory replay mechanisms. Because 1) BioSLAM replays samples according to reward (importance), while GR replays randomly and evenly. 2) BioSLAM has static memory to refresh the dynamic memory buffer to keep diverse and important memory traces, as well as easier to adapt to new trajectory observations.

Vii Experiment Analysis

In this section, we analysis the lifelong place recognition results on both large-scale City areas and long-term Campus scenarios. As shown in [16], generative replay shows superior performance to other continual learning approaches. The training quality is highly related to how the generated samples can better represent the entire data distribution. In our BioSLAM system, such ability is determined by the generative memory replay as stated in section IV-B and our BiLM system as stated in section V. Specifically, we also investigate how different methods can handle the geometric and domain changes and how our BioSLAM system can achieve long-lasting memorization based on lifelong memory systems.

Fig. 9: Comparison of BioSLAM and baselines in terms of recall@k on City dataset.

Vii-a Large-scale City Place Recognition

We evaluate the performance of the BioSLAM in a large scale lifelong learning scenario with the City dataset. For the localization task under city-scale environments, robots may encounter multiple types of 3D geometric structures within the urban environments, such as open-street, bridges, parks, big buildings, and residential areas. We divide the $50$ trajectories with $120$ km distance within the city into $3$ different areas based on their geometric properties: area 1 for commercial buildings, area 2 for parks, and area 3 for residential districts. For the large-scale City dataset, observations from different areas can be treated as different domains $D_{t}$ in Eq. 2.

In the training procedure, we incrementally feed the place recognition methods with trajectory observations from 3 different areas. Fig. (a)a shows the weighted recall curve of trajectory observations within area 1, area 2, and area 3, respectively. Fig. (b)b shows the average weighted recall curve of all trajectories during training. As can be seen, BioSLAM outperforms other methods during training and is at least $14 %$ better than other baselines in terms of final average recall. More importantly, BioSLAM could keep the knowledge about previous trajectory observations when trained with new trajectory observations. For example, at epoch 240, when the training observations switched from area 2 to area 3, the performance drop on previous trajectories for BioSLAM is much smaller than in other methods, as shown in Fig. (b)b. Because BioSLAM retrains important previous knowledge by replaying related memory traces. Note that, BioSLAM replays important and highly rewarded memory traces, while GR only replays randomly. Thus BioSLAM has a much higher convergence rate and final performance than other baselines.

After training, we evaluate the generalization ability of the final trained model on the fixed test set. Fig. 9 shows the comparison between BioSLAM and other baselines in terms of top- $k$ recall on the test set of City dataset. As can be seen, although BioSLAM learns incrementally, it still performs better than baselines on classic (non-lifelong learning) offline test set evaluation, while some non-lifelong learning methods are designed or trained for offline evaluation.

Vii-B Cross-domain Campus Place Recognition

Place recognition methods incrementally trained on trajectory observations from Lidar, day-time visual, and night-time visual inputs. The dashed region shows the standard deviation. — (a) Recalls per task (domain)

Fig. 11: Comparison of BioSLAM and baselines in terms of recall@k on Campus dataset.

Training place recognition models on independent domains are inefficient because no information will be shared. We thus demonstrate the merit of BioSLAM in more reasonable settings where the model benefits from solving place recognition from multiple domains (Lidar, day-time vision, night-time vision). A place recognition model operating in multiple domains has several advantages. First, the knowledge of one domain can help better and faster understand other domains, because the domains are not completely independent in place recognition tasks. Second, generalization over multiple domains may result in more universal knowledge that applies to unseen domains. Such phenomenon is also observed in infants learning [4, 9].

We evaluate the performance of BioSLAM on long-term and cross-domain lifelong learning scenario with Campus dataset. In the training procedure, we incrementally feed the place recognition methods with trajectory observations from different domains (ordered with Lidar, day-time vision, night-time vision), and evaluate the performance on all domains. Observations from Lidar, day-time vision, and night-time visual signals can be treated as different domains $D_{t}$ in Eq. 2.

Comparison between BioSLAM and its variants: (1) w/o — (a) Ablation study on City dataset

Fig. (a)a shows the performance comparison between BioSLAM and baselines on different domains of the Campus dataset. In $0 \sim 600$ epochs, we train the place recognition model on the Lidar domain and the performance of all methods on all domains increases within 600 epochs. This verifies that the knowledge of one domain can help better and faster understand other domains. In $600 \sim 1200$ epochs, we train the model in the day-time visual domain. For all methods, the performance in the day-time visual domain increases but the Lidar performance decrease around the switching point of epoch=600. This is reasonable because the background of Lidar and day-time visual images are totally different. In the Lidar domain, the performance drop of BioSLAM is much smaller than in other methods. This shows that BioSLAM could learn observations in a new domain without forgetting observations from past domains. In $1200 \sim 1800$ epochs, we train the model in the night-time visual domain. In addition to the performance increases in the night-time visual domain, the performance of BioSLAM also increases in the Lidar domain. With efficient replay mechanisms of BioSLAM, the knowledge of one domain can help better understand other domains. The average weighted recall of all domains is shown in Fig. (b)b. As can be seen, the performance of all methods is similar in the beginning, but as new observations from new domains are added, BioSLAM converges faster and better than baselines and outperforms other methods by at least 10% in terms of final average recall.

After training, we evaluate the cross-domain generalization ability of the final trained model on the fixed test set. Fig. 11 shows the comparison between BioSLAM and other baselines in terms of top- $k$ recall on the test set of the Campus dataset. As can be seen, although BioSLAM learns incrementally, it still performs better than baselines on classic (non-lifelong learning) offline test set evaluation, while some non-lifelong learning methods are designed or trained for offline evaluation.

Weighted Recall (%)		City	Campus
Non-learning	BOW	5.7	60.1
Non-learning	CoHOG	70.1	85.1
Learning based (not lifelong)	RegionVLAD	45.3	75.1
Learning based (not lifelong)	NetVLAD	47.8	72.2
Lifelong learning	SI	65.1	73.7
	GR	68.4	76.1
	BioSLAM	73.6	91.2

TABLE II: Comparison of weighted recall (%) on city and campus datasets.

Table II shows the comparison between different methods both on the fixed test set of City and Campus datasets. In City dataset, BioSLAM outperforms a state-of-the-art lifelong learning method GR by 7.6 %, and a non-learning method CoHOG by 5%. In Campus dataset, BioSLAM outperforms a lifelong learning method GR by 19.8 %, and a non-learning method CoHOG by 7.2%. Note that, this paper focuses on incremental and lifelong learning scenarios, so the most important evaluation metric is the recall curve over incremental learning (as shown in Fig. 10 and 8). For a recall on the fixed test set, some non-lifelong learning methods (i.e. CoHOG) may perform very well, but these methods can not learn incrementally, so the performance of the non-lifelong learning methods is limited. Thus lifelong learning methods have a higher potential in wider and changing real environments.

Vii-C Ablation Study

As mentioned in section V, BioSLAM has several novel mechanisms that differ from previous lifelong learning methods: (1) external reward $R_{e x}$ to indicate localization performance; (2) internal reward $R_{i n}$ to indicate the robustness of feature representation; (3) Static memory consolidation to abstract concise memory traces, and clustering (Eq. 14) is the key of memory consolidation; (4) Dynamic memory refreshing to effectively replay important memory, in which, the time-decay mechanism (Eq. 18) for importance weight is critical. We further evaluate the effectiveness of the above mechanisms of BioSLAM by comparing BioSLAM with the following variants: (1) w/o $R_{e x}$ : without applying external reward, then Eq. 13 becomes $R_{k} = R_{i n} (q_{k})$ ; (2) w/o $R_{i n}$ : without applying internal reward, then Eq. 13 becomes $R_{k} = R_{e x} (q_{k})$ ; (3) w/o consolidation-clustering: without using clustering in static memory consolidation, then Algorithm 1 becomes directly storing all memory traces in static memory; (4) w/o time-decay: without using the time decay factor in importance sampling, which is equivalent to set $γ = 1$ in dynamic memory refresh Eq. 18. Note that these variants follow the control variates method. They cover all important mechanisms of BioSLAM without overlapping functionalities.

The results for the ablation study on City dataset are shown in Fig. (a)a, and the results on Campus dataset are shown in Fig. (b)b. As can be seen, BioSLAM outperforms its variants on both City and Campus datasets. The removal of any component leads to a significant performance drop. In particular, the performance grain of BioSLAM with respect to “w/o time-decay” validates the necessity of decay weights (rewards) of importance sampling during dynamic memory refresh. The larger performance drop is caused by removing internal rewards, which means the indicator (internal reward) for the robustness of feature representation is critical to retrieving memories. The performance of “ w/o cluster-consolidation” is close to BioSLAM, but BioSLAM is much memory efficient by clustering and downsampling.

. (a) City dataset. The left column represents the trajectories within area 1, area 2, and area 3. The right three columns represent the corresponding similarity matrices over training. (b) Campus dataset. The left column represents Lidar, day-time visual, and night-time visual observations. The right three columns represent the corresponding similarity matrices over training. — (a) Similarity matrix from different areas on City dataset

. (a) City dataset. Visualization of observations from different areas with PCA. (b) Campus dataset. Visualization of observations from different trajectories and different domains with PCA. — (a) PCA visualization on City dataset

Vii-D BioSLAM Feature Property

In this section, we visualize and evaluate the BioSLAM learned features (place descriptor, Eq. 5) $F (q_{k})$ with similarity matrix and Principle Component Analysis (PCA). Similarity Matrix $M_{sim}$ is defined by the cosine similarity between reference $O^{D}$ and query $Q^{D}$ features, with $M_{sim} (i, j) = cos (F (O_{i}^{D}), F (Q_{j}^{D}))$ . A high-contrast similarity matrix indicates that the learned feature $F$ has a strong expression and discrimination ability.

The similarity matrix of BioSLAM over training on City dataset is shown in Fig. (a)a. The left column represents the sampled trajectories from area 1, area 2, and area 3. The right three columns represent the similarity matrices of the corresponding trajectories (from left to right) after incrementally training on area 1, area 2, and area 3, respectively. After training on area 1, the similarity matrices of all areas increase contrast. Then training on area 2 and area 3, the similarity matrix of area 1 is almost non-decayed. That means BioSLAM still has strong expression ability on past trajectories when learning from different areas.

The similarity matrix of BioSLAM over training on Campus dataset is shown in Fig. (b)b. The left column represents the sampled observations from Lidar, day-time visual, and night-time visual inputs. The right three columns represent the similarity matrices of the corresponding observations (from left to right) after incrementally training on Lidar, day-time visual, and night-time visual domains, respectively. After training on the Lidar domain, the similarity matrix of the Lidar observations increases contrast. Then incrementally training the model on day-time and night-time visual domain, the similarity matrix of the corresponding domain become more contrastive, and the similarity matrix of Lidar almost does not decay. That means BioSLAM could remember the past domains when learning from totally different domains.

We used PCA to reduce the dimension of BioSLAM learned features to 2d. The PCA visualization of learned features of observations from different areas on the City dataset is shown in Fig. (a)a. The subfigures from left to right represent PCA visualization results at the initial step, and after incrementally training on area 1, area 2, and area 3. As can be seen, with BioSLAM training, observations within the same area are almost clustered together. The clusters of the different areas become easier to discriminate over incremental learning.

The PCA visualization of learned features from different domains and trajectories on the Campus dataset is shown in Fig. (b)b . (For clearer visualization, we only visualize three trajectory segments in each domain). The sub-figures from left to right represents PCA visualization results at the initial step, and after incrementally training on Lidar, day-time visual, and night-time visual domains. As can be seen, BioSLAM not only differentiates different domains (3 clusters from left to right) but also differentiates different trajectories within each domain. As shown in the right sub-figure in Fig. (b)b, the same trajectories of different domains are relatively close. As an example, for trajectory 1, the PCA results of the Lidar domain and night-time visual domains are close to each other and located in the lower part of the PCA visualization results. That means, BioSLAM could incrementally learn trajectories from different domains, and it has the potential to find the cross-domain relationship of place observations from different domains.

Vii-E BioSLAM Memory Activity

(a) Proportion of different trajectories in dynamic memory

(a) Proportion of different domains in dynamic memory

As described in section V-B, static memory $M_{s}$ stores selected memory traces by memory consolidation. Because clustering and downsampling are based on feature and spatial property (Algorithm 1), the static memory stores concise and diverse observations in terms of feature and spatial property. Then dynamic memory $M_{d}$ samples memory traces from static memory with importance sampling, and the sampling weight is proportional to its reward value.

The memory traces in the dynamic memory zone has a direct impact on lifelong learning performance because the replayed samples from dynamic memory are used for training at every iteration. In this section, we visualize the dynamic memory zone to see the proportion of memory traces from different domains or trajectory segments. Note that, if memory traces of a domain have higher rewards than other domains, then the dynamic memory zone holds more memory traces (samples) from the domain. Thus, to better understand the proportion between different domains, we also visualize the (normalized) reward ratio of each domain or trajectory segment. For trajectory $i$ , $reward\_ratio(i)=¯R(i)∑j¯R(j)$ , where ${¯ R}_{(i)}$ is the average reward for observations from trajectory $i$ .

For the City dataset, the proportion of observations from different trajectory segments in the dynamic memory zone is shown in Fig. (a)a. As the new trajectory segments incrementally feed into BioSLAM (every 60 epochs), the trajectory diversity in the dynamic memory zone increases. Fig. (b)b shows the reward ratio of different trajectories of training. As can be seen, the proportion of different trajectories in the dynamic buffer is consistent with the reward ratio of corresponding trajectories. Because given a trajectory segment, a higher reward means worse performance, BioSLAM uses higher sampling weights to retrieve more memory traces from the high-rewarded trajectory to achieve better performance. As an example, from epoch 360 to 420, the reward ratio of trajectory 1 increases in Fig. (b)b, and dynamic memory samples more memory traces of trajectory 1.

For the Campus dataset, the proportion of observations from different domains in the dynamic memory zone is shown in Fig. (a)a. Fig. (b)b shows the reward ratio of different domains of training. As can be seen, the proportion of different domains in the dynamic buffer is consistent with the reward ratio of corresponding domains. As an example, in the final step, the reward ratio of the night-visual domain is lower than other domains in Fig. (b)b, which means BioSLAM already achieves better performance in the night-visual domain. Then the dynamic memory tends to retain only a small amount of night-visual memory traces, leaving valuable memory zone for other high-rewarded domains (i.e. LiDAR).

Thus, we have the following relationship between dynamic memory zone and rewards: the performance on a domain (or trajectory) $i$ is lower $\to$ higher reward on the domain (or trajectory) $i \to$ more memory traces of domain (or trajectory) $i$ in dynamic memory $\to$ training on more replayed samples from the domain (or trajectory) $i \to$ the performance on the domain (or trajectory) $i$ may increases. The above relationship between dynamic memory and rewards can serve as feedback compensation.¹¹1Note that, the performance on the domain (or trajectory) $i$ may be saturated and not increase. But the relationship still works as feedback compensation.

Vii-F Incremental Confidence

. (a) City dataset. Sampled trajectories and confidence maps for area 1, area 2, area 3. (b) Campus dataset. Sampled trajectories and confidence maps for Lidar, day-time visual, and night-time visual observations. — (a) Confidence map from different areas on City dataset

To illustrate the incremental learning property, we evaluate the BioSLAM with a confidence map during training. The confidence score of a place is defined by the cosine similarity between the corresponding reference $O_{k}$ and query $Q_{k}$ features, with $confidence (k) = cos (F (O_{k}), F (Q_{k}))$ . A higher $confidence (k)$ denotes the robust feature representation of place $Q_{k}$ , because the feature representations of query and reference in the same place are similar. The confidence map is measured by calculating the confidence score for all observations and visualizing the confidence scores on real trajectories.

The confidence map of BioSLAM over training on City dataset is shown in Fig. (a)a. The left column represents the sampled trajectories from area 1, area 2, and area 3. The right three columns represent the confidence map of the corresponding trajectory (from left to right) after incrementally training on area 1, area 2, and area 3, respectively. After training on area 1, the confidence map of all areas becomes better. Then training on area 2 and area 3, the confidence map of area 1 is almost non-decayed. That means training BioSLAM on a trajectory also helps to improve the performance of others.

The confidence map of BioSLAM over training on Campus dataset is shown in Fig. (b)b. The left column represents the sampled observations from Lidar, day-time visual, and night-time visual inputs of the same trajectory. The right three columns represent the confidence map of the corresponding domains after incrementally training on Lidar, day-time visual, and night-time visual inputs of the trajectory. After training on the Lidar domain, the confidence map of the Lidar domain increases. Then incrementally training the model on the day-time and night-time visual domain, the confidence map of the corresponding domain increases, and other domains almost do not change. That means BioSLAM can remember past domains when learning from other domains.

Method	NetVLAD	SI	GR	BioSLAM
GPU Memory (MB)	1261	1265	1695	1695

TABLE III: Comparison of GPU memory (Megabyte) of different methods.

Vii-G Run-time Analysis

In this section, we introduce memory and time usage for lifelong learning. For both methods, we evaluate an Ubuntu 18.04 system, using an Nvidia RTX 2080 Ti (12 GB) graphics processing unit (GPU), Intel Core i9-7900x processors, and $64$ gigabyte (GB) memory. Table III shows the total memory usage on City datasets. The GPU memory usages of BioSLAM are acceptable under the current embedded system structure.

Fig. 18 shows the time usage of the BioSLAM lifelong learning procedure on the City dataset when incrementally feeding new trajectory segments. The distance for each segment is about 2km and is composed of about 200 observation frames. The data inference procedure takes $< 1$ s for each trajectory segment, which is efficient in real-world inference. The average time for memory consolidation and forgetting is around $40$ s for each trajectory, which is fast enough to analyze the newly captured memory traces, and update the memory system. And the memory replay takes $2$ s to generate replayed samples for place recognition training. Finally, place recognition optimization takes $40$ s to optimize a $2$ km new trajectory in one epoch. For multi-epoch training, BioSLAM runs inference, replay, and optimization multi-times, but only runs memory consolidation once. In general, training a trajectory segment about 50 times can get convergence results. So the total learning time for a 2km trajectory segment is $40 s + (1 + 2 + 40) * 50 s = 2190 s \approx 36 m i n$ . Given that the distance between neighbor keyframes is $10$ m, in this case, BioSLAM can learn $100$ m new areas in around 1.8 minutes.

We can note the most important property, the time usage in the above memory operations will not be affected by scale differences in either spatial or temporal. This is mainly benefited by our memory forgetting mechanism, which can maintain the searching space of $M_{S}, M_{d}$ and keep up-to-date memory traces to balance the localization. The above properties indicate that BioSLAM can be applied to low-cost robotic systems for long-term place recognition tasks on NVIDIA embedded systems.

Fig. 18: Time usage (inference, optimization, replay, consolidation time) of the BioSLAM lifelong learning on City dataset when incrementally learning new trajectories.

Viii Discussion & Limitations

BioSLAM can provide robust lifelong learning ability for long-term and large-scale place recognition tasks. For long-term localization tasks, BioSLAM can pre-store the long-lasting memory traces in the static memory $M_{S}$ and retrieval generative memories from dynamic memory $M_{D}$ , which maintains the recognition ability for diverse conditions. The above dual-memory mechanism can enable efficient place feature learning for new types of observations and maintain the lifelong memorization ability for old knowledge. We can also notice that the model benefits from solving place recognition from multiple domains in lifelong learning. In the evaluation of the Campus dataset, for the BioSLAM method, the knowledge of one domain can help better and faster understand other domains.

Finally, the essential property of our BioSLAM framework is its extensibility. We can develop a similar lifelong learning framework for other perception, navigation, or reinforcement learning tasks. Let’s recall the network structures as shown in Fig. 3, the functional modules that related to the place recognition task is mainly the place descriptor extraction $F_{θ}$ and the relative external reward $R_{e x}$ in the behavior configurator. For other tasks (such as 3D segmentation, local navigation, etc.), one can replace the place descriptor extraction network with a task-relative representation network, and objective reward, then not need to replace the entire blocks in the lifelong memory system. Also, another potential option is to develop a parallel hybrid lifelong learning system for multiple tasks since the encoder module $E$ can be shared.

In general, BioSLAM provides a memory system for lifelong place recognition. Robots can develop more general navigation and decision-making approaches based on the incrementally updated localization ability, given that most current systems can only work in short-term and local-scale environments. On the other hand, BioSLAM also provides a new option for other lifelong learning tasks.

Ix Conclusion

The real-world robots will encounter diverse environmental changes under long-term autonomy. In the place recognition task, the robots continuously observe new scenarios, which are unbounded under variant conditions. To alleviate the above problem, we proposed BioSLAM, a lifelong place recognition method in this work. BioSLAM combines a general place learning (GPL) system and a bio-inspired lifelong memory (BiLM) system. The GPL system utilizes a viewpoint-invariant place descriptor and a generative replay module to achieve the ‘memory encoding’ and ‘memory replay’ for continual place feature learning. The BiLM system provides a dual-memory mechanism, controlled by a behavior configurator to guide the ‘memory consolidation’, ‘memory forgetting’, and ‘memory replay’ to enhance the memorization of long-term traces. We investigate the large-scale and long-term place recognition ability in the experiments with city-scale 3D point-cloud maps and campus-scale visual-LiDAR hybrid inputs. Both results show that BioSLAM can significantly balance the place learning ability for new observations and maintain the memorization ability for history observations.

In practice, our method can be applied to the low-cost mobile robots with the current embedded devices, with a lightweight memory system without saving massive streaming datasets. Another interesting direction for future work is to enable memory sharing between client agents to the cloud server; in this case, the server can be synced with data from all kinds of scenarios by variant robots to update a more general place recognition. Finally, the BioSLAM system can be utilized in other perception tasks via modifying objective functions in the behavior configurator by requirements.

X Acknowledgment

This research is supported by grants from NVIDIA and utilized NVIDIA SDKs (CUDA Toolkit, TensorRT, and Omniverse). This research is supported by the ARL grant NO.W911QX20D0008 and partially supported by the National Science Foundation (NSF) under Grant No. 2144489. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of ARL and NSF.

References

[1] S. An, H. Zhu, D. Wei, K. A. Tsintotas, and A. Gasteratos (2022) Fast and incremental loop closure detection with deep features and proximity graphs. Journal Of Field Robotics 39 (4), pp. 473–493. Cited by: §II-A.
[2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition., pp. 5297–5307. Cited by: §II-A, §IV-A, §VI-D.
[3] R. Arandjelovic and A. Zisserman (2013) All about vlad. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1578–1585. Cited by: §II-A.
[4] D. A. Baldwin, E. M. Markman, and R. L. Melartin (1993) Infants’ ability to draw inferences about nonobvious object properties: evidence from exploratory play. Child Development 64 (3), pp. 711–728. Cited by: §VII-B.
[5] T. Barros, R. Pereira, L. Garrote, C. Premebida, and U. J. Nunes (2021) Place recognition survey: an update on deep learning approaches. arXiv preprint arXiv:2106.10458. Cited by: §II.
[6] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §V-A1.
[7] G. Berdugo-Vega and J. Graeff (2022) Inquiring the librarian about the location of memory. Cognitive Neuroscience 0 (0), pp. 1–3. Cited by: §I, §III-C.
[8] M. G. Berman, J. Jonides, and R. L. Lewis (2009) In search of decay in verbal short-term memory.. Journal of Experimental Psychology: Learning, Memory, and Cognition 35 (2), pp. 317. Cited by: §V-B3.
[9] M. H. Bornstein and M. E. Arterberry (2010) The development of object categorization in young children: hierarchical inclusiveness, age, perceptual attribute, and group versus individual analyses.. Developmental psychology 46 (2), pp. 350. Cited by: §VII-B.
[10] J. Byrne (2017) Learning and memory: a comprehensive reference. Academic Press. Cited by: §V-B1.
[11] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard (2016) Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Transactions on robotics 32 (6), pp. 1309–1332. Cited by: §I.
[12] X. Chen, T. Läbe, A. Milioto, T. Röhling, O. Vysotska, A. Haag, J. Behley, and C. Stachniss (2020) OverlapNet: Loop Closing for LiDAR-based SLAM. In Proceedings of Robotics: Science and Systems (RSS), Cited by: §II-A.
[13] Y. Choi, M. El-Khamy, and J. Lee (2021) Dual-teacher class-incremental learning with data-free generative replay. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3543–3552. Cited by: §II-B.
[14] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling (2018) Spherical cnns. In 6th International Conference on Learning Representations, ICLR, Cited by: §IV-A.
[15] M. Cummins and P. Newman (2008) FAB-map: probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research 27 (6), pp. 647–665. Cited by: §II-A.
[16] M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2021) A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-A, §II-B, §II, §III-A, §VII.
[17] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236. Cited by: §I.
[18] C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis (2018) Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV)., pp. 52–68. Cited by: §IV-A.
[19] J. M. Facil, D. Olid, L. Montesano, and J. Civera (2019) Condition-invariant multi-view place recognition. arXiv preprint arXiv:1902.09516. Cited by: §II-A.
[20] D. Gálvez-López and J. D. Tardós (2012-10) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics. 28 (5), pp. 1188–1197. Cited by: §VI-D.
[21] E. Garcia-Fidalgo and A. Ortiz (2018) Ibow-lcd: an appearance-based loop-closure detection approach using incremental bags of binary words. IEEE Robotics and Automation Letters 3 (4), pp. 3051–3057. Cited by: §II-A.
[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems. 27. Cited by: §IV-B.
[23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §II-A.
[24] L. He, X. Wang, and H. Zhang (2016) M2DP: a novel 3d point cloud descriptor and its application in loop closure detection. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 231–237. Cited by: §II-A.
[25] L. Hui, H. Yang, M. Cheng, J. Xie, and J. Yang (2021) Pyramid point cloud transformer for large-scale place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6098–6107. Cited by: §II-A.
[26] A. Khaliq, S. Ehsan, Z. Chen, M. Milford, and K. McDonald-Maier (2020) A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE Transactions on Robotics 36 (2), pp. 561–569. Cited by: §II-A.
[27] A. Khaliq, S. Ehsan, Z. Chen, M. Milford, and K. McDonald-Maier (2020) A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE Transactions on Robotics. 36 (2), pp. 561–569. Cited by: §VI-D.
[28] G. Kim and A. Kim (2018) Scan context: egocentric spatial descriptor for place recognition within 3d point cloud map. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4802–4809. Cited by: §II-A.
[29] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §II-B.
[30] J. G. Klinzing, N. Niethard, and J. Born (2019) Mechanisms of systems memory consolidation during sleep. Nature neuroscience 22 (10), pp. 1598–1610. Cited by: §I.
[31] K. A. Krueger and P. Dayan (2009) Flexible shaping: how learning in small steps helps. Cognition 110 (3), pp. 380–394. Cited by: §V-A1.
[32] Y. Latif, R. Garg, M. Milford, and I. Reid (2018) Addressing challenging place recognition tasks using generative adversarial networks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2349–2355. Cited by: §I.
[33] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. D. Rodríguez (2020) Continual learning for robotics: definition, framework, learning strategies, opportunities and challenges. Information Fusion 58, pp. 52–68. Cited by: §II-B, §II.
[34] Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §II-B.
[35] Z. Liu, S. Zhou, C. Suo, P. Yin, W. Chen, H. Wang, H. Li, and Y. Liu (2019) LPD-net: 3d point cloud learning for large-scale place recognition and environment analysis. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV, pp. 2831–2840. Cited by: §II-A.
[36] D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. Advances in neural information processing systems 30. Cited by: §II-B.
[37] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford (2016) Visual place recognition: a survey. IEEE Transactions on Robotics 32 (1), pp. 1–19. Cited by: §II-A, §II.
[38] A. Mallya and S. Lazebnik (2018) Packnet: adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7765–7773. Cited by: §II-B.
[39] K. Mikolajczyk and C. Schmid (2005) A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence. 27 (10), pp. 1615–1630. Cited by: §II-A.
[40] M. J. Milford and G. F. Wyeth (2012) SeqSLAM: visual route-based navigation for sunny summer days and stormy winter nights. In 2012 IEEE international conference on robotics and automation, pp. 1643–1649. Cited by: §II-A.
[41] E. I. Moser, E. Kropff, and M. Moser (2008) Place cells, grid cells, and the brain’s spatial representation system. Annu. Rev. Neurosci. 31, pp. 69–89. Cited by: §I.
[42] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §II-B.
[43] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to SIFT or SURF. In 2011 International Conference on Computer Vision., Vol. , pp. 2564–2571. Cited by: §II-A.
[44] A. Segal, D. Haehnel, and S. Thrun (2009) Generalized-icp.. In Robotics: science and systems, Vol. 2, pp. 435. Cited by: Fig. 5, §VI-A.
[45] H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. Advances in neural information processing systems 30. Cited by: §VI-D.
[46] H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), I. Guyon, U. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 30. Cited by: §IV-B, §IV-B.
[47] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II-A, §IV-A.
[48] R. Stickgold (2005) Sleep-dependent memory consolidation. Nature 437 (7063), pp. 1272–1278. Cited by: §III-C.
[49] J. Stretton and P. Thompson (2012) Frontal lobe function in temporal lobe epilepsy. Epilepsy research 98 (1), pp. 1–13. Cited by: §I.
[50] M. A. Uy and G. H. Lee (2018) PointNetVLAD: deep point cloud based retrieval for large-scale place recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition,, pp. 4470–4479. Cited by: §II-A, §VI-D.
[51] G. M. van de Ven, H. T. Siegelmann, and A. S. Tolias (2020) Brain-inspired replay for continual learning with artificial neural networks. Nature Communications 11, pp. 4069. Cited by: §IV-B.
[52] G. M. Van de Ven and A. S. Tolias (2018) Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §II-B.
[53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §II-A.
[54] P. Yin, L. Haowen, S. Zhao, R. Fu, C. Ivan, R. Ge, I. Cisneros, R. Fu, J. Zhang, H. Choset, and S. Scherer (2022) AutoMerge: a framework for map assembling and smoothing in city-scale environments. arXiv preprint arXiv:2205.19737. Cited by: §V-A2.
[55] P. Yin, F. Wang, A. Egorov, J. Hou, Z. Jia, and J. Han (2021) Fast sequence-matching enhanced viewpoint-invariant 3-d place recognition. IEEE Transactions on Industrial Electronics 69 (2), pp. 2127–2135. Cited by: §II-A.
[56] P. Yin, F. Wang, A. Egorov, J. Hou, J. Zhang, and H. Choset (2020) Seqspherevlad: sequence matching enhanced orientation-invariant place recognition. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5024–5029. Cited by: §II-A, §III-B, §IV-A, §IV-A.
[57] P. Yin, L. Xu, J. Zhang, H. Choset, and S. Scherer (2021) I3dLoc: image-to-range cross-domain localization robust to inconsistent environmental conditions. In Proceedings of Robotics: Science and Systems (RSS ’21), Cited by: §I, §III-B, §IV-A, §IV-A, §V-A2.
[58] P. Yin, L. Xu, J. Zhang, and H. Choset (2021) Fusionvlad: a multi-view deep fusion networks for viewpoint-free 3d place recognition. IEEE Robotics and Automation Letters 6 (2), pp. 2304–2310. Cited by: §II-A, §IV-A.
[59] M. Zaffar, S. Ehsan, M. Milford, and K. McDonald-Maier (2020) CoHOG: a light-weight, compute-efficient, and training-free visual place recognition technique for changing environments. IEEE Robotics and Automation Letters. 5 (2), pp. 1835–1842. Cited by: §VI-D.
[60] M. Zaffar, S. Garg, M. Milford, J. Kooij, D. Flynn, K. McDonald-Maier, and S. Ehsan (2021) VPR-bench: an open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. International Journal of Computer Vision. 129 (7), pp. 2136–2174. External Links: ISSN 1573-1405 Cited by: §I, §II-A.
[61] F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: §II-B, §VI-D.
[62] J. Zhang and S. Singh (2014) LOAM: lidar odometry and mapping in real-time.. In Robotics Science and Systems, Vol. 2, pp. 1–9. Cited by: Fig. 5, §VI-A.
[63] W. Zhang and C. Xiao (2019) PCAN: 3d attention map learning using contextual information for point cloud based retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12436–12445. Cited by: §II-A.
[64] X. Zhang, L. Wang, and Y. Su (2021) Visual place recognition: a survey from deep learning perspective. Pattern Recognition 113, pp. 107760. Cited by: §II.