ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer

Hongkai Chen HKUST                Apple Inc.    Zixin Luo HKUST                Apple Inc.    Lei Zhou HKUST                Apple Inc.    Yurun Tian HKUST                Apple Inc.    Mingmin Zhen HKUST                Apple Inc.   
Tian Fang
HKUST                Apple Inc.
   David McKinnon HKUST                Apple Inc.    Yanghai Tsin HKUST                Apple Inc.    Long Quan HKUST                Apple Inc.
Abstract

Generating robust and reliable correspondences across images is a fundamental task for a diversity of applications. To capture context at both global and local granularity, we propose ASpanFormer, a Transformer-based detector-free matcher that is built on hierarchical attention structure, adopting a novel attention operation which is capable of adjusting attention span in a self-adaptive manner. To achieve this goal, first, flow maps are regressed in each cross attention phase to locate the center of search region. Next, a sampling grid is generated around the center, whose size, instead of being empirically configured as fixed, is adaptively computed from a pixel uncertainty estimated along with the flow map. Finally, attention is computed across two images within derived regions, referred to as attention span. By these means, we are able to not only maintain long-range dependencies, but also enable fine-grained attention among pixels of high relevance that compensates essential locality and piece-wise smoothness in matching tasks. State-of-the-art accuracy on a wide range of evaluation benchmarks validates the strong matching capability of our method.

Keywords:
Image Matching, Visual Localization, Pose Estimation, Transformer
An illustration of the proposed adaptive attention span (top row) and final dense matching results (bottom row). Particularly, in the top row, a rectangle with
Figure 1: An illustration of the proposed adaptive attention span (top row) and final dense matching results (bottom row). Particularly, in the top row, a rectangle with uniform sampling grid is drawn to explain the position and size of adaptive attention span. In addition, three typical types of correspondences are visualized. Easy match in green with rich texture, which can be well localized and matched with small local contexts. Hard match in yellow with little texture, which requires larger contexts to guide matching. Impossible match in red in non-overlapping or occluded region, which has a very large attention span to avoid falsely fitting to certain regions. With this design, we enable Transformer to adaptively capture necessary context according to matching difficulty.

1 Introduction

Image matching lays the foundation for various geometric computer vision tasks, including Structure from Motion (SfM) [37, 28], visual localization [36], and Simultaneous Localization And Mapping (SLAM) [26, 27]. As a widely accepted pipeline for image matching, cross-image correspondences are usually established by matching a set of detected and described sparse keypoints, such as SIFT [20], ORB [32], or their learning-based counterparts [29, 6, 23, 8, 21]. Despite its general effectiveness, this detector-based matching pipeline struggles in extreme situations, including large view point changes and textureless areas, due to the reliance on keypoint detector and context loss in feature description.

Concurrent with detector-based matching, another line of works focus on generating correspondences directly from raw images [39, 15, 48, 30, 18, 50, 24, 38, 42], where richer context can be leveraged while keypoint detection step can be eschewed. Earlier works [30, 31, 18] in detector-free matching often rely on iterative convolution upon correlation or cost volume to discover potential neighbourhood consensus. Recently, some works [39, 15, 42] base their methods on Transformer [51, 7] backbone for better modeling of long-range dependencies. As a representative work, LoFTR utilizes self and cross attention blocks to update cross-view features, where Linear Transformer [16] is adopted to replace global full attention in order to achieve manageable computation cost. Although proven effective, a concern about LoFTR is the lack of fine-level local interaction among pixel tokens, which could limit its capability to extract highly accurate and well-localized correspondences. This concern is deepened by the findings [42] of Tang et al., which reveals that the cross attention map generated by LoFTR’s Linear Transformer tends to diffuse among large areas instead of sharply focusing on actual corresponding regions.

To capture both global context and local details, we propose a Transformer-based detector-free matcher, equipped with a hierarchical attention framework. Our foundation processing blocks, referred to as Global-Local Attention (GLA) block, performs a coarse-level global attention at low resolution to acquire long-range dependencies, meanwhile, conducts fine-level local attention at high resolution within only a concentrated region around a current correspondence found through dense flow prediction.

The key challenge for fine-level local attention is to determine the size of the attention span. A naive approach is to regard its size as a fixed hyper parameter, which, however, neglects the intrinsic matchability of different regions where the dependency of context varies. As shown in Figure 1, regions in rich texture areas can be easily matched within a small neighbourhood, while the textureless areas are more uncertain about their correspondences and require larger context for matching, not to mention areas that lie out of overlapping regions and are impossible to be matched. To mitigate this issue, we introduce an adaptive attention span driven by probabilistic modelling, which can be adjusted for different locations based on underlying matching difficulty. We summarize our contributions in three aspects:

  • A hierarchical attention framework is proposed for feature matching, where attention operations are performed at different scales to enable both global context awareness and fine-grained matching.

  • A novel uncertainty-driven scheme, based on probabilistic modelling of flow prediction, is proposed to adaptively adjust local attention span. Through this design, our network assigns varying size of contexts to different locations according to their essential matchability and context richness.

  • State-of-the-arts results on extensive set of benchmarks are achieved. Our method outperforms both detector-free and detector-based matching baselines in two-view pose estimation. Further experiments on challenging visual localization also proves our method’s potential to be integrated into c

    omplicated down-stream applications.

2 Related Works

2.1 Detector-Free Image Matching

Differing from detector-based matching methods, which typical involve detecting [8, 6, 23, 29], describing [25, 45, 22, 21, 52] and matching [34, 3, 56, 54, 40, 2, 1] a set of keypoints, detector-free matching consumes a pair of images and output correspondences in one shot. Thanks to the removal of keypoint detection stage, detector-free matching is able to capture richer contexts from original images, thus exhibits strong potential to match in extreme situations, such as low texture areas and repetitive patterns.

Despite the potential merits of detector-free matching, its popularity hardly outperforms detector-based methods during early deep learning times due to the intrinsic difficulties in robust and distinctive features. Recently, with the help of deep neural network, possibility is explored to build high performance detector-free matching frameworks based on deep features, which can roughly be classified into two categories: cost volume-based methods [30, 50, 48, 31, 18, 47] and Transformer-based methods [39, 15, 42]. Both kinds of methods leverage strong signals in deep features’ correlation, either in form of correlation layer or cross attention, to guide correspondence search and feature update. Our method follows works on Transformer-based methods and employs multilevel cross attention for mutual feature update, encoding two-view contexts into original features for both global and local consensus.

2.2 Global-Local Structure

Balancing receptive field and interaction granularity is a long-standing issue for both cost volume-based and Transformer-based matching. To ensure global receptive field, cost volume based methods are often designed to perform convolution on large global correlation volume, while Transformer-based methods need to conduct attention among all pixel tokens in image pairs. Due to the high cost of global interaction, the input features are usually downsampled into coarse resolution [18, 50, 15] or being projected into low rank [39], which to some degree limits the networks’ capability for fined grained feature update.

Complementary to global interaction, some methods propose to perform local interaction only within a certain region instead of a global field, enabling to process fine level features given a limited computation budget. This practice is especially common in cost volume based methods and are referred to as local correlation layer [50, 14, 48, 43], where the cost volumes/vectors are only constructed around neighbourhood of each correspondence estimation. Intuitively, the idea of complementary global-local interaction can also be introduced to Transformer-based matcher. In our method, a global-local attention block is proposed for message passing across images, ensuring both global receptive field and fine level feature processing. Specially, instead of fixing span for local attention, we design an adaptive mechanism to determine the size of area that each pixel should attend to.

2.3 Flow Regression and Uncertainty Modeling

Flow maps depicts correspondence coordinates, which can either be absolute or relative, for each location in an image. Predicting correspondence coordinates from an image pair has been intensively investigated by works in optical flow estimation [10, 14, 55, 43] and general dense image matching [50, 48, 47]. In these works, the flow maps are regressed from structured correlation volumes which are implicitly position-aware. Recently, a Transformer-based method, COTR [15], proves that flows can also be retrieved from positional-embedded features after several turns of attention update.

Naturally, the reliability of flow estimation in each location is not equal and predicting associated confidence scores is essential for many scenarios. As an elegant framework for uncertainty prediction, some works [48, 58, 11, 13, 5] propose to use probabilistic model to jointly explain both flows estimations and their confidence. Inspired by above works, we propose a network that regresses a flow map for each attention block to guide local attention region and adjust the attention span adaptively based on uncertainty prediction.

3 Methodology

We use CNN backbone to extract initial features. After initialization, the features are fed into iterative GLA blocks for updating. A matching module is used to determine final matches.
Figure 2: We use CNN backbone to extract initial features. After initialization, the features are fed into iterative GLA blocks for updating. A matching module is used to determine final matches.

We present an overview of our network structure in Figure 2. Taking an image pair as input, our network produces reliable correspondences across images. The matching process starts with a CNN-based encoder to extract initial features for both images separately. After initialization, these features are turned into and fed into the proposed Adaptive Span Transformer (ASpanFormer) module for updating, which is composed of iterative global-local attention (GLA) blocks with hierarchical structure. Particularly, for each GLA block, we regress auxiliary flow maps describing correspondence coordinates (flows) and their uncertainty. Instead of adopting these flow maps as our correspondence output, we use them to guide local cross attention, enabling adaptive local attention span according to matching uncertainty. After GLA blocks, the updated features are use to construct coarse level matches, which will be further refined into final correspondences.

In the following part, we demonstrate the details of each individual block as well as the underlying insights.

3.1 Preliminary

Before introducing the structure of our network, we first clarify necessary notations and concepts.

3.1.1 Attention.

As the key operation in Vision Transformer, attention is defined over a set of query (), key () and value () vectors as

(1)

where are linear projections of upstream features and is retrieved message. More specially, in the context of cross attention, are derived from source features and vectors are derived from target features . is used to update source features through a feed forward network (FFN), which involves concatenation, layer normalization and linear layers.

(2)

Typically, in each pass, the position of source/target features can be switched and cross attention is performed symmetrically.

3.1.2 Flow map.

Flow maps depict the correspondence relationship between an image pair , such that for any location in an image, . Here, denotes that the points on two sides are correspondences.

Instead of depicting simple correspondences, a stream of works [48, 58, 11, 13, 5] proposes to model flow field with a probabilistic framework. Following these works, we model the flow field as a Gaussian distribution defined by a set of parameters. More specifically, assuming conditional independence among pixels, two flow maps are predicted, such that , where are estimated correspondence coordinates and are standard deviations. The probability for is given by

(3)

Instead of thresholding flow estimation with uncertainty, we use it to adjust the search region for subsequent network interaction, as described in later sections.

3.2 Feature Extractor

As the first part of our network, a convolutional neural network (CNN) is used to extract down-sampled initial features for each image. As is shown in previous works [23, 29, 6, 21, 22, 45, 25, 52], CNN exhibits strong capability to capture local context and generates high-level features, which can be directly used to perform nearest neighbour matching. However, since these features are processed independently for each image and critical cross view contexts are missed. To enrich features with long range and cross view contexts, the initial features are further fed into our proposed Transformer module for updating.

3.3 Initialization

Our Transformer-module starts with a fast initialization block, which conducts (1) positional encoding and (2) two-view contexts initialization.

Positional encoding. As validated in Transformer networks [39, 34, 15], positional encoding is critical in maintaining spatial information for the flattened tokens. Following the same formulation in LoFTR [39], 2D sinusoidal signals in different frequencies are used to encode position information and are added to initial features. Specially, we apply normalization when testing resolution differs from training resolution. We provide more details in Appendix A.5.

Two-view contexts initialization. At each local attention phase, our network requires regressing an auxiliary flow map as guidance, which requires cross view contexts. To this end, we pass positional embedded features to a light-weight cross attention block. More specifically, These features are downsampled to low resolution and two global cross attention blocks are used for feature processing. After initialization, the features are upsampled back to original input resolution, denoted as , and sent to iterative global-local attention blocks for further processing.

3.4 Global-Local Attention Block

The basic structure of our Transformer network is global-local attention (GLA) block. As is shown in Figure 2, for each GLA block, attention is performed upon a 3-level coarse-to-fine feature pyramid built by strided average pooling.

For the -th GLA block, global attention is conducted on coarsest downsampled features in resolution , while local attention with adaptive span is used to pass message between medium-resolution features in resolution and fine level features in resolution . Note that we keep the coarsest resolution as a constant, making the complexity of global full attention unaffected by input size. Retrieved messages from coarse/medium/fine level are upsampled to same resolution, concatenated and fused with an MLP to update source features.

(4)
(5)

The FFN in our network is defined as

(6)

LN stands for layer normalization. Specially, we adopt a convolution in FFN for locality modeling, which compensate for the absence of self attention within each feature map. Empirically, we find convolution in FFN works better than the combination of linear projection FFN and self attention, more details can be found in Appendix A.5.

Illustration Local cross attention. Query map
Figure 3: Illustration Local cross attention. Query map are partitioned into cells in size ( in this case), retrieving prediction from flow map and generate attention span. Here we only show attention span for one cell(marked in red).

Local cross attention with adaptive attention span. To facilitate fine-grained attention with modest cost, we adopt local attention on medium and fine level feature maps, where attention span focuses on the neighbourhood regions around current correspondences estimation.

A key problem for local attention is how to define the size of neighbourhood region. A naive approach is to define neighbourhood with a fixed radius for all pixels, neglecting the fact that the optimal attention span for different regions varies. For instance, it is sufficient to match regions with distinctive features using small contexts, while regions that are harder to match require larger contexts. Instead of using fixed attention span for all pixels, we propose to adaptively adjust the attention span according to the uncertainty of flow estimation. This design lets each area balance their local receptive fields with uncertainty awareness. Regions with high confidence in flow estimation can sharply focus on a small region for fine level matching, while larger contexts are extracted in low confidence areas for better convergence.

Formally, for the -th GLA block, we first regress flow maps from input features in fine level with an MLP, while the medium level flow map are obtained by strided average pooling. For each scale level, we partition the corresponding query map into cells with size . For each cell, we use the mean flow estimation to generate a rectangle region upon map and uniformly sample a fixed number of tokens. Attention is performed between each cell and the sampled tokens. The detailed process is defined in Algorithm 1. An illustration for local attention is given in Fig. 3. Since number of sampled tokens for each location is fixed, the whole process retains linear complexity.

Input: , span coefficient , sample number , window size
Output: Retrieved message

1:Partition into cells set with window size , there will be cells in total
2:
3:for each cell in  do in parallel
4:     Retrieving flow from flow map according to the location of
5:     Let
6:     Let be a rectangle area with center , width and height
7:     Uniformly sample tokens in region from , denoted as
8:     Attention
9:     Append to
10:Reshaping into
11:return
Algorithm 1 Local Cross Attention

3.5 Matches Determination

We inherit the scheme in LoFTR [39] to generate final correspondences, including a coarse matching stage and a sub-pixel refinement stage.

After being updated by GLA blocks, we flatten the output features into and construct correlation matrix , where is a temperature parameter and are feature numbers of two images. By applying dual-direction softmax in both column/row dimensions, a score matrix is given by , from which we retain coarse-level matches by mutual nearest neighbour (MNN) and filtering scores below a certain threshold . The coarse matches are further fed into a correlation-based refinement block, which is the same with LoFTR [39], to obtain the final matching results.

3.6 Loss Formulation

We formulate the final loss from three parts, (1) coarse matches loss , (2) fine-level loss and (3) flow estimation loss

(7)

For coarse level loss , the ground truth matches is determined by reprojection using depth and camera poses in datasets. We supervise the dual-softmax score matrix with cross entropy loss

(8)

The fine-level loss is supervised directly with L2-distance between each refined coordinates and ground truth reprojection coordinates, which are further normalized by the coordinate variance.

For flow estimation supervision, we minimize the log-likelihood for each estimated distribution. Formally, given flow estimation from each layer and ground truth flow , is defined as

(9)

In our implementation, this log-likelihood formulation can be further substituted and decomposed into a more compact form, which is elaborated in Appendix B.

3.7 Implementation Details

Our network shares the same ResNet-18 [12] CNN feature extractor with LoFTR. After feature extraction and flow initialization, we use 4 GLA blocks for updating. For adaptive attention span, we set , meaning that using 5 standard deviation to crop local neighbourhood region for each token. We uniformly sample features in each cropped local region.

We train two different models specified for indoor and outdoor scenes respectively. Both models are optimized using Adam with learning rate for 30 epochs on 8 V-100 GPUs. Indoor model is trained on ScanNet [4] dataset with batch size 24, where the training consumes 5 days. Outdoor model is trained on MegaDepth [19] with batch size 8, taking 2 days to converge. More details about implementation are introduced in Appendix A.3.

4 Experiments

In this section, we demonstrate the performance of our method on two-view pose estimation and visual localization tasks, among both indoor and outdoor scenes. Besides, we conduct ablation study to validate the effectiveness of key design components of our method.

4.1 Two-view Pose Estimation

We resort to two popular datasets, ScanNet [4] and MegaDepth [19], introduced below, to demonstrate the matching ability of our method in indoor scenes and outdoor scenes, respectively. We also provide additional results on YFCC100M [44] and Image Matching Challenge(IMC) 2022 in Appendix C.

Indoor two-view matching dataset. ScanNet dataset [4] is composed of 1613 sequences, each of which contains RGB images exposing large view changes and repetitive or textureless patterns, with ground-truth depth maps and camera poses associated. For fair comparison, we follow the same training and testing protocols used by SuperGlue [34] and LoFTR [39], where 230M and 1.5K image pairs are sampled for training and testing, respectively. In congruent with LoFTR, we resize all test images to .

Outdoor two-view matching dataset. MegaDepth [19] consists of 196 3D reconstructions from 1M Internet images, whose camera poses and depth maps are initially computed from COLMAP [37] and then refined as ground-truth. We perform two view pose estimation on 1.5k testing pairs. All test images are resized so that their longest dimension is 1152.

Evaluation protocols. Following previous works [34, 39], we train and evaluate our method separately on the two datasets. Two-view pose is recovered by solving essential matrix from correspondences produced, while pose accuracy is measured by AUC at multiple error thresholds (5, 10 and 20). A pose is only considered accurate if both its angular rotation error and translation error is under a certain threshold compared with ground-truth poses.

Comparative methods. We compare the proposed method with 1) detector-based approaches, including SuperGlue [34] and SGMNet [3] that are equipped with SuperPoint(SP) [6] as local feature extractor, 2) detector-free approaches, including DRC-Net [18], PDC-Net [48, 49], LoFTR [39], QuadTree Attention [42], MatchFormer [53] and DKM [9].

Results. As presented in Table 2 and Table 2, our method consistently achieves the best accuracy in both indoor and outdoor scenes. Visualization in Figure 4 qualitatively demonstrates our method performance against other matches. More visualizations are provided in Appendix D.

Qualitative results of dense matching in different scenarios.
Figure 4: Qualitative results of dense matching in different scenarios.

 missingmissing

Method Pose Estimation AUC
@ @ @

 missingmissing

SP [6]+SuperGlue [34] 16.2 33.8 51.8
SP [6]+SGMNet [3] 15.4 32.1 48.3
DRC-Net [18] 7.7 17.9 30.5
PDC-Net+(H) [49] 20.2 39.4 57.1
LoFTR [39] 22.0 40.8 57.6
QuadTree [42] 24.9 44.7 61.8
MatchFormer [53] 24.3 43.9 61.4
DKM [9] 24.8 44.4 61.9
Ours 25.6 46.0 63.3

 missingmissing

Table 2: Two-view pose estimation results on MegaDepth dataset [19] in outdoor scenes.

 missingmissing

Method Pose Estimation AUC
@ @ @

 missingmissing

SP [6]+SuperGlue [34] 42.2 61.2 75.9
SP [6]+SGMNet [3] 40.5 59.0 73.6
DRC-Net [18] 27.0 42.9 58.3
PDC-Net+(H) [49] 43.1 61.9 76.1
LoFTR [39] 52.8 69.2 81.2
QuadTree [42] 54.6 70.5 82.2
MatchFormer [53] 53.3 69.7 81.8
DKM [9] 54.5 70.7 82.3
Ours 55.3 71.5 83.1

 missingmissing

Table 1: Two-view pose estimation results on ScanNet dataset [4] in indoor scenes.

4.2 Visual Localization

Apart from evaluation on two-view pose estimation task, we further integrate our network into a visual localization pipeline, and use two popular datasets, InLoc [41] and Aachen Day-Night v1.1 [57, 36, 35] datasets, to demonstrate performance on multi-view matching in indoor scenes and outdoor scenes, respectively.

Indoor localization dataset. InLoc dataset [41] contains a database of RGBD indoor images that are geometrically registered to form the reference scene model, while RGB query images are provided for visual localization, annotated with manually verified camera poses. Great challenge is posed in matching textureless or repetitive patterns under large perspective differences.

Outdoor localization dataset. Aachen Day-Night v1.1 dataset [57] depicts a city whose reference scene model is built upon day-time images. For visual localization, the dataset provides another day-time images and night-time images as queries. Great challenge is posed in identifying correspondences from, in particular, night-time images under extremely large illumination changes.

Evaluation protocols. We follow the instructions from Long-Term Visual Localization Benchmark [46] to compute query poses. For both datasets, we use pre-trained HLoc [33] to retrieve candidate pairs, and recover camera poses with the model trained on MegaDepth dataset following SuperGlue [34] and LoFTR [39]. More details on localization pipeline are elaborated in Appendix A.4.

Results. On InLoc dataset, as shown in Table 4, our methods achieves overall best results compared with multiple comparative methods. On Aachen V1.1, as shown in Table 4, our method outperforms all other methods except SuperGlue. We partially ascribe this to the fact that we use only coarse matches for database reconstruction (see Appendix A.4.), causing localization error that harms pose estimation. In general, our method generalizes well in practical pipelines.

 missingmissing

Method DUC1 DUC2
(0.25m,2) / (0.5m,5) / (1m,10)

 missingmissing

HLoc [33] + SP [6]+SuperGlue [34] 49.0 / 68.7 / 80.8 53.4 / 77.1 / 82.4
HLoc [33] + LoFTR [39] 47.5 / 72.2 / 84.8 54.2 / 74.8 / 85.5
HLoc [33] + Ours 51.5 / 73.7 / 86.4 55.0 / 74.0 / 81.7

 missingmissing

Table 4: Visual localization results on Aachen V1.1 dataset [57].

 missingmissing

Method Day Night
(0.25m,2) / (0.5m,5) / (1m,10)

 missingmissing

Localization with matching pairs provided in dataset
R2D2 [29] + NN - 71.2 / 86.9 / 98.9
ASLFeat [23] + NN - 72.3 / 86.4 / 97.9
SP [6]+SuperGlue [34] - 73.3 / 88.0 / 98.4
SP [6]+SGMNet [3] - 72.3 / 85.3 / 97.9
Localization with matching pairs generated by HLoc
SP [6]+SuperGlue [34] 89.8 / 96.1 / 99.4 77.0 / 90.6 / 100.0
LoFTR [39] 88.7 / 95.6 / 99.0 78.5 / 90.6 / 99.0
Ours 89.4 / 95.6 / 99.0 77.5 / 91.6 / 99.5

 missingmissing

Table 3: Visual localization results on InLoc dataset [41].

4.3 Ablation Study

To validate the effectiveness of different design components of our method, we conduct ablation experiments on ScanNet dataset [4] following the same setting in Section 4.1. Specifically, we compare three designs of attention structure:

  • Single-Level Attn.: A design with only global attention at coarsest feature maps without the need of flow estimation. In this design, global context is well captured, whereas essential locality in motion smoothness is omitted and fine-grained message exchange becomes difficult.

  • Multi-Level Attn.: A design with the hierarchical attention framework proposed in this paper, except that the size of local attention span is fixed to 13 px, i.e., the statistical mean of the adaptive attention span used in our network.

  • Adaptive Span Attn.: Our full design that enables hierarchical attention with adaptive attention span. By this means, the need of context for different pixels is dynamically decided regarding different matchability.

As presented in Table 6, both hierarchical global-local attention and adaptive attention span improve overall performance by a considerable margin, validating the essentiality of our network designs.

 missingmissing

Method Pose Estimation AUC
@ @ @

 missingmissing

Single-Level Attn. 22.65 40.72 59.06
Multi-Level Attn. 24.85 44.86 62.71
Adaptive Span Attn. 25.61 46.04 63.33

 missingmissing

Table 6: Flow estimation accuracy.

 missingmissing

Stage 6px (%) 5 (px)
Matchable Unmatchable Total

 missingmissing

Iter#1 69.1 9.2 19.4 13.4
Iter#2 71.2 8.2 20.2 12.5
Iter#3 72.0 7.8 23.8 12.6
Iter#4 72.3 7.7 27.1 13.3

 missingmissing

Table 5: Ablation study on ScanNet dataset [4].

4.4 Understanding ASpanFormer

Flow estimation. We analyze the flow estimation through multiple iterations. As shown in Table 6, precision of flow regression is gradually improved as attention iterations are performed and converges after four iterations.

As for uncertainty estimation, we split all pixels into two categories, matchable and unmatchable pixels, identified by ground-truth camera poses and depths, and report their mean standard deviation (). On one hand, mean decreases with iterations for matchable pixels, as the network becomes more certain about its flow prediction in later stages. On the other hand, the network gradually increases uncertainty values of unmatchable pixels to prevent over-confidence to a certain region.

Uncertainty map. In Figure 5, we provide visualization of uncertainty map of flow prediction. Overlapping and non-overlapping regions are firstly distinguished, while uncertainty values in textureless regions are usually larger, indicating context of larger size is required during cross attention.

Visualization of uncertainty map which is predicted along with flows, warmer color indicates smaller uncertainties.
Figure 5: Visualization of uncertainty map which is predicted along with flows, warmer color indicates smaller uncertainties.

Runtime evaluation. We evaluate the runtime of proposed method and compare it with LoFTR [39] where both methods apply Transformer backend. The runtime speed is tested on 100 randomly sampled ScanNet image pairs () with a NVIDIA V100 GPU. Runtime differs only on Attention Module compared with LoFTR, as we adopt the same Local Feature CNN backbone and coarse-to-fine matching module. As shown in Table 7, the proposed method is overall slightly slower than LoFTR due to the more complicated attention operation.

 missingmissing

Stage Runtime (ms)
LoFTR Ours

 missingmissing

Local Feature CNN 32.2 32.2
Attention Module 24.6 40.5
Matching 40.9 40.8
Total 97.7 113.5

 missingmissing

Table 7: Runtime speed evaluated on 640480 images.

5 Conclusion

In this paper, we have proposed a novel Transformer framework based on feature hierarchy, whose attention span is adaptively decided so as to acquire capabilities to capture both long-term dependencies as well as fine-grained details in local regions. State-of-the-art results validates the effectiveness of our method. With more engineering optimizations, we are looking forward to wider application of our method in real use.

Appendix

Appendix A Implementation Details

In this section, we provide more details about our network implementation.

a.1 Network Settings

We use the same ResNet-18 for initial feature extractor as that in LoFTR, which outputs feature maps in two resolution, and . The feature map is passed into our transformer-based network for updating, while the is used in fine matches coordinates refinement. For dual-softmax in coarse matching, we adopt a learnable temperature which is initialized as 10.

We use four GLA blocks to update features. For hierarchical attention, we fix the coarsest feature maps in resolution , where for indoor settings and for outdoor settings.

a.2 Flow Regression

As stated in Sec. 3.4, we use an MLP to regress auxilary flow map in each GLA block. Given D-dimensional feature in pixel, we use MLP with shape (D,64,4) to regress a 4-dimensional feature . For corresponding coordinates , We normalize the first two values with sigmoid function and recover them to the range of image resolution. For the standard variance , we regress the last two values as their logarithm. Formally,

(10)

where are image height and width.

a.3 Training Details

For both indoor and outdoor training, we adopt the same muti-step training strategy as that in officially released LoFTR code. More specifically, the learning rate is linearly warmed-up in this first epoch and then halved every two or three epochs. The learning rate curve is illustrated in Fig. 6.

a.4 Visual Localization Details

We refer to hierachical localization pipeline (https://github.com/cvg/Hierarchical-Localization) to perform viusal localization experiments on Aachen Day-Night and InLoc datasets.

For Aachen Day-Night, we first triangulate reference models by using only coarse matches across images. We then generate fine level matches between query images and database images, where the database images are taken as left images, so that the fine level matches can be registered to triangulated 3D tracks.

For InLoc dataset, we directly generate fine level matches between query and database images, where the 2D match points on reference images are projected to 3D space through the provided depth map. We omit image pairs with fewer than 25 matches.

Learning rate curve across iterations.
Figure 6: Learning rate curve across iterations.

a.5 Some Effective Designs

We provide ablations for some additional useful designs in our network: (1) learnable temperature for softmax at each level. (2)Convolution-based FFN. (3) Normalized positional encoding when testing resolution differs from training resolution. An ablation study for these techniques is provided in Tab. 9 and Tab. 9.

Learnable Temperature. As stated in Sec. 3.4, message are computed from different levels of feature maps through global or local attention, where softmax are applied to tokens in different numbers. A concern about softmax is the that the number of tokens largely affect the final distribution. To balance the impact of different token number in global/local attention, we adopt three learnable temperature parameters for softmax in fine, medium and coarse level features respectively.

Convolutional FFN. As shown in Sec. 3, our networks is fully based on cross attention for cross-view message passing, while self attention is absent. Deviating from common practice that employs self attention for intra-image message passing, we find in our experiment that adopting convolution in FFN to replace self attention and MLP-based FFN leads to better overall performance.

Normalized Positional Encoding. Positional encoding (PE) in LoFTR is defined as,

A concern about this PE is that unseen coordinate will be used in encoding when testing resolution differs from training resolution, which harms the network’s capability of precise localization and boundary awareness. To mitigate the issue, we adopt a simple normalization technique,

(11)
(12)

where are width/height of training/testing image. We find this normalization boost the performance of our method when training/testing image resolution differ. Aligning testing/training PE is especially critical for precise flow prediction, since it relies on PE to regress flow coordinate.

In Tab. 9, we provide ablation study results for normalized positional encoding (NPE). The results are obtained on MegaDepth dataset with all images resized to 1152 resolution, while the models are trained in 832 resolution.

 missingmissing

Method Pose Estimation AUC
@ @ @

 missingmissing

AspanFormer w/o learnable temperature 25.0 45.7 62.3
AspanFormer w SA+MLP-FFN 24.8 45.5 62.0
AspanFormer 25.6 46.0 63.3

 missingmissing

Table 9: Ablation study of Normalized Positional Encoding (NPE) on MegaDepth dataset [19].

 missingmissing

Method Pose Estimation AUC Flow Acc.
@ @ @

 missingmissing

AspanFormer w/o NPE 52.8 69.6 81.1 22.6
AspanFormer 55.3 71.5 83.1 72.3

 missingmissing

Table 8: Ablations on network designs on ScanNet [4] dataset. SA+MLP-FFN, means adopting 1/4 downsampled self attention after each GLA block and replacing all conv in FFN of both self/cross attention with MLP.

Appendix B Flow Loss

We formulate flow supervision as most-likelihood estimation for Gaussian distribution .

(13)

where is the ground truth flow and are predicted parameters at location . Substituting into Gaussian distribution formula, we have

(14)
(15)

In implementation, we let and omit constant terms, then

(16)

Intuitively, this loss formulation is a weighted sum of L2-distance between estimated flows and ground truth flows. is a regularization term encouraging lower uncertainty. The overall effect of flow loss is to minimize uncertainty and flow estimation error simultaneously.

Appendix C Additional Quantitative Results

We provide in this part additional experiment results on YFCC100M dataset and Image Matching Challenge 2022 (IMC 2022) kaggle benchmark.

c.1 Results on YFCC100M

YFCC100M contains a collection of internet images across various tourism landmarks. We adopt the test set from 4 selected landmark sequences as is done in previous works [56, 34, 3]. 1000 image pairs are sampled from each sequence, which yields 4000 pairs test set in total. We use OpenCV ransac for two-view pose estimation, where the RANSAC threshold for all methods is set to in normalized image coordinate space. Experiment results are given in Tab. 11, where our method outperforms all comparative methods.

c.2 Results on Image Matching Challenge 2022

We submit our method to Image Matching Challenge (IMC) 2022 and report the results in Tab. 11. We resize the input image to a fixed resolution [1472,832] and use OpenCV USAC_MAGSAC to estimate fundamental matrix, where the RANSAC threshold is set to 0.2 pixel. The results show that our method consistently outperforms other strong comparative baselines.

 missingmissing

Method Pose Estimation AUC
@ @ @

 missingmissing

SP [6]+SuperGlue [34] 38.1 58.8 74.7
RootSIFT+SGMNet [3] 35.5 55.2 71.9
DRC-Net [18] 29.5 50.1 66.8
PDC-Net+(H) [49] 39.1 60.1 76.5
LoFTR [39] 42.4 62.5 77.3
Ours 44.5 63.8 78.4

 missingmissing

Table 11: Two-view pose estimation results on IMC 2022 kaggle benchmark. The Results of MatchFormer and QuadTree attention are reported by the 4th solution on Kaggle discussion forum [17].

 missingmissing

Method Pose Estimation mAA
Private Public

 missingmissing

SP [6]+SuperGlue [34] 0.724 0.728
LoFTR [39] 0.783 0.772
MatchFormer [53] 0.783 0.774
QuadTree [42] 0.817 0.812
Ours 0.838 0.833

 missingmissing

Table 10: Two-view pose estimation results on YFCC100M [44] dataset in outdoor scenes.

Appendix D Additional Visualizations

We provide more visualization results in this part. In Fig 7, we provide qualitative comparisons between SuperGlue, LoFTR and our methods. In Fig 8, we provide flow predictions across GLA block iterations. In Fig 9, we provide additional visualization of uncertainty heatmap and corresponding adaptive attention spans.

Visualizations of matches obtained through SuperGlue, LoFTR and ASpanFormer(ours). Our methods produces more accurate and denser matches compared with both SOTA sparse and dense matching networks. Visualizations of matches obtained through SuperGlue, LoFTR and ASpanFormer(ours). Our methods produces more accurate and denser matches compared with both SOTA sparse and dense matching networks.
Figure 7: Visualizations of matches obtained through SuperGlue, LoFTR and ASpanFormer(ours). Our methods produces more accurate and denser matches compared with both SOTA sparse and dense matching networks.
Visualizations of flow prediction across GLA iterations. We filter flow predictions with high uncertainty. Note that the flow map are in
Figure 8: Visualizations of flow prediction across GLA iterations. We filter flow predictions with high uncertainty. Note that the flow map are in () resolution. As more GLA blocks are employed for feature updating, the flow map gradually prune occluded or non-overlap regions.
Visualizations of uncertainty heatmap(left) and adaptive attention span(right). Our network sharply focuses on regions with rich and distinctive textures with small attention span, while larger contexts are extracted for the low texture or uncertain regions. Specially, very large attention spans are generated for non-overlapping or occluded areas, preventing falsely focusing on certain regions.
Figure 9: Visualizations of uncertainty heatmap(left) and adaptive attention span(right). Our network sharply focuses on regions with rich and distinctive textures with small attention span, while larger contexts are extracted for the low texture or uncertain regions. Specially, very large attention spans are generated for non-overlapping or occluded areas, preventing falsely focusing on certain regions.

References

  • [1] J. Bian, W. Lin, Y. Liu, L. Zhang, S. Yeung, M. Cheng, and I. Reid (2020) GMS: grid-based motion statistics for fast, ultra-robust feature correspondence. IJCV. Cited by: §2.1.
  • [2] L. Cavalli, V. Larsson, M. R. Oswald, T. Sattler, and M. Pollefeys (2020) Handcrafted outlier detection revisited. In ECCV, Cited by: §2.1.
  • [3] H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C. Tai, and L. Quan (2021) Learning to match features with seeded graph matching network. In ICCV, Cited by: §C.1, Table 11, §2.1, §4.1, Table 2, Table 4.
  • [4] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: Table 9, §3.7, §4.1, §4.1, §4.3, Table 2, Table 6.
  • [5] M. Danelljan, L. Gool, and R. Timofte (2020) Probabilistic regression for visual tracking. In CVPR, Cited by: §2.3, §3.1.2.
  • [6] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: self-supervised interest point detection and description. In CVPRW, Cited by: Table 11, §1, §2.1, §3.2, §4.1, Table 2, Table 4.
  • [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §1.
  • [8] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019) D2-net: a trainable cnn for joint description and detection of local features. In CVPR, Cited by: §1, §2.1.
  • [9] J. Edstedt, M. Wadenbäck, and M. Felsberg (2022) Deep kernelized dense geometric matching. Preprint. Cited by: §4.1, Table 2.
  • [10] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox (2015) FlowNet: learning optical flow with convolutional networks. In ICCV, Cited by: §2.3.
  • [11] J. Gast and S. Roth (2018) Lightweight probabilistic deep networks. In CVPR, External Links: Document Cited by: §2.3, §3.1.2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.7.
  • [13] E. Ilg, z. iek, S. Galesso, A. Klein, O. Makansi, F. Hutter, and T. Brox (2018) Uncertainty estimates and multi-hypotheses networks for optical flow. In ECCV, Cited by: §2.3, §3.1.2.
  • [14] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In CVPR, Cited by: §2.2, §2.3.
  • [15] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi (2021) COTR: correspondence transformer for matching across images. In CVPR, Cited by: §1, §2.1, §2.2, §2.3, §3.3.
  • [16] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret (2020) Transformers are rnns: fast autoregressive transformers with linear attention. In ICML, Cited by: §1.
  • [17] I. Lashkov 4th solution of imc 2022. Note: https://www.kaggle.com/competitions/image-matching-challenge-2022/discussion/328805 Cited by: Table 11.
  • [18] X. Li, K. Han, S. Li, and V. Prisacariu (2020) Dual-resolution correspondence networks. In NeurIPS, Cited by: Table 11, §1, §2.1, §2.2, §4.1, Table 2.
  • [19] Z. Li and N. Snavely (2018) Megadepth: learning single-view depth prediction from internet photos. In CVPR, Cited by: Table 9, §3.7, §4.1, §4.1, Table 2.
  • [20] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. IJCV. Cited by: §1.
  • [21] Z. Luo, T. Shen, L. Zhou, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan (2019) ContextDesc: local descriptor augmentation with cross-modality context. In CVPR, Cited by: §1, §2.1, §3.2.
  • [22] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, and L. Quan (2018) GeoDesc: learning local descriptors by integrating geometry constraints. In ECCV, Cited by: §2.1, §3.2.
  • [23] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan (2020) Aslfeat: learning local features of accurate shape and localization. In CVPR, Cited by: §1, §2.1, §3.2, Table 4.
  • [24] J. Min and M. Cho (2021) Convolutional hough matching networks. In CVPR, Cited by: §1.
  • [25] A. Mishchuk, D. Mishkin, F. Radenović, and J. Matas (2017) Working hard to know your neighbor’s margins:local descriptor learning loss. In NeurIPS, Cited by: §2.1, §3.2.
  • [26] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-SLAM: a versatile and accurate monocular slam system. IEEE transactions on robotics. Cited by: §1.
  • [27] R. Mur-Artal and J. Tardos (2016) ORB-SLAM2: an open-source slam system for monocular, stereo and rgb-d cameras. IEEE Transactions on Robotics. Cited by: §1.
  • [28] A. Resindra, A. Torii, and M. Okutomi (2018) Structure from motion using dense cnn features with keypoint relocalization. IPSJ Transactions on Computer Vision and Applications. Cited by: §1.
  • [29] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger (2019) R2D2: repeatable and reliable detector and descriptor. In NeurIPS, Cited by: §1, §2.1, §3.2, Table 4.
  • [30] I. Rocco, M. Cimpoi, R. Arandjelovi, A. Torii, T. Pajdla, and J. Sivic (2018) Neighbourhood consensus networks. In NeurIPS, Cited by: §1, §2.1.
  • [31] I. Rocco, R. Arandjelović, and J. Sivic (2020) Efficient neighbourhood consensus networks via submanifold sparse convolutions. In ECCV, Cited by: §1, §2.1.
  • [32] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Cited by: §1.
  • [33] P. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk (2019) From coarse to fine: robust hierarchical localization at large scale. In CVPR, Cited by: §4.2, Table 4.
  • [34] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020) Superglue: learning feature matching with graph neural networks. In CVPR, Cited by: §C.1, Table 11, §2.1, §3.3, §4.1, §4.1, §4.1, §4.2, Table 2, Table 4.
  • [35] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla (2018) Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions. In CVPR, Cited by: §4.2.
  • [36] T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt (2012) Image retrieval for image-based localization revisited.. In BMVC, Cited by: §1, §4.2.
  • [37] J. L. Schonberger and J. Frahm (2016) Structure-from-motion revisited. In CVPR, Cited by: §1, §4.1.
  • [38] X. Shen, F. Darmon, A. Efros, and M. Aubry (2020) RANSAC-flow: generic two-stage image alignment. In ECCV, Cited by: §1.
  • [39] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou (2021) LoFTR: detector-free local feature matching with transformers. In CVPR, Cited by: Table 11, §1, §2.1, §2.2, §3.3, §3.5, §3.5, §4.1, §4.1, §4.1, §4.2, §4.4, Table 2, Table 4.
  • [40] W. Sun, W. Jiang, A. Tagliasacchi, E. Trulls, and K. M. Yi (2020) Attentive context normalization for robust permutation-equivariant learning. In CVPR, Cited by: §2.1.
  • [41] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii (2018) InLoc: indoor visual localization with dense matching and view synthesis. In CVPR, Cited by: §4.2, §4.2, Table 4.
  • [42] S. Tang, J. Zhang, S. Zhu, and P. Tan (2021) QuadTree attention for vision transformers. In ICLR, Cited by: Table 11, §1, §2.1, §4.1, Table 2.
  • [43] Z. Teed and J. Deng (2020) RAFT: recurrent all-pairs field transforms for optical flow. In ECCV, Cited by: §2.2, §2.3.
  • [44] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li (2016) YFCC100M: the new data in multimedia research. Communications of the ACM. Cited by: Table 11, §4.1.
  • [45] Y. Tian, B. Fan, and F. Wu (2017) L2-net: deep learning of discriminative patch descriptor in euclidean space. In CVPR, Cited by: §2.1, §3.2.
  • [46] C. Toft, W. Maddern, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, T. Pajdla, et al. (2020) Long-term visual localization revisited. TPAMI. Cited by: §4.2.
  • [47] P. Truong, M. Danelljan, L. Gool, and R. Timofte (2020) GOCor: bringing globally optimized correspondence volumes into your neural network. In NeurIPS, Cited by: §2.1, §2.3.
  • [48] P. Truong, M. Danelljan, L. V. Gool, and R. Timofte (2021) Learning accurate dense correspondences and when to trust them. In CVPR, Cited by: §1, §2.1, §2.2, §2.3, §2.3, §3.1.2, §4.1.
  • [49] P. Truong, M. Danelljan, R. Timofte, and L. Van Gool (2021) PDC-Net+: enhanced probabilistic dense correspondence network. Preprint. Cited by: Table 11, §4.1, Table 2.
  • [50] P. Truong, M. Danelljan, and R. Timofte (2020) GLU-Net: global-local universal network for dense flow and correspondences. In CVPR, Cited by: §1, §2.1, §2.2, §2.2, §2.3.
  • [51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §1.
  • [52] Q. Wang, X. Zhou, B. Hariharan, and N. Snavely (2020) Learning feature descriptors using camera pose supervision. In ECCV, Cited by: §2.1, §3.2.
  • [53] Q. Wang, J. Zhang, K. Yang, K. Peng, and R. Stiefelhagen (2022) MatchFormer: interleaving attention in transformers for feature matching. Preprint. Cited by: Table 11, §4.1, Table 2.
  • [54] K. M. Yi*, E. Trulls*, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua (2018) Learning to find good correspondences. In CVPR, Cited by: §2.1.
  • [55] Z. Yin and J. Shi (2018) GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In CVPR, Cited by: §2.3.
  • [56] J. Zhang, D. Sun, Z. Luo, A. Yao, L. Zhou, T. Shen, Y. Chen, L. Quan, and H. Liao (2019) Learning two-view correspondences and geometry using order-aware network. In ICCV, Cited by: §C.1, §2.1.
  • [57] Z. Zhang, T. Sattler, and D. Scaramuzza (2021) Reference pose generation for long-term visual localization via learned features and view synthesis. IJCV. Cited by: §4.2, §4.2, Table 4.
  • [58] L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang, and L. Quan (2020) KFNet: learning temporal camera relocalization using kalman filtering. In CVPR, Cited by: §2.3, §3.1.2.