SSORN: Self-Supervised Outlier Removal Network for Robust Homography Estimation

Yi Li1, Wenjie Pei1, Zhenyu He2
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China.
ly_res@163.com, wenjiecoder@outlook.com, zhenyuhe@hit.edu.cn.

^{*}

Equal contribution.

†

Corresponding author.

Abstract

The traditional homography estimation pipeline consists of four main steps: feature detection, feature matching, outlier removal and transformation estimation. Recent deep learning models intend to address the homography estimation problem using a single convolutional network. While these models are trained in an end-to-end fashion to simplify the homography estimation problem, they lack the feature matching step and/or the outlier removal step, which are important steps in the traditional homography estimation pipeline. In this paper, we attempt to build a deep learning model that mimics all four steps in the traditional homography estimation pipeline. In particular, the feature matching step is implemented using the cost volume technique. To remove outliers in the cost volume, we treat this outlier removal problem as a denoising problem and propose a novel self-supervised loss to solve the problem. Extensive experiments on synthetic and real datasets demonstrate that the proposed model outperforms existing deep learning models.

Homography estimation, outlier removal, self-supervised, cost volume.

I Introduction

Homography estimation is a fundamental task in many computer vision applications, such as Augmented Reality (AR) [14, 39, 32], image stitching [6, 7, 26, 33], and Simultaneous Location And Mapping (SLAM) [23, 24, 30]. Although this task has been extensively studied in the past, designing a robust homography estimation method remains a challenging problem.

The traditional homography estimation pipeline mainly includes four steps: feature detection, feature matching, outlier removal and transformation estimation. Among these four steps, feature detection is generally considered to be the most important step. Therefore, a large number of popular handcrafted features, such as SIFT [20], SURF [4] and ORB [27], have been proposed over the past few decades. However, the feature representation capabilities of these feature are limited, especially compared to deep learning features.

Currently, there are two research lines using deep convolutional networks (CNNs) for homography estimation. The first research line is to replace one of the four steps in the traditional homography estimation pipeline with CNNs or neural networks [34, 9, 5, 10, 38]. For example, SuperPoint [9] adopts a VGG-like architecture for interest point detection and description. SuperGlue [28] solves the matching problem based on a graph neural network. DSAC [5] uses a neural networks to mimic the RANSAC algorithm [10]. However, optimizing each step individually does not necessarily improve the performance of the homography estimation task.

Another research line is to build an end-to-end model for homography estimation. DeTone et al. [8] propose HomographyNet, the first end-to-end model for homography estimation. It adopts a typical CNNs with 8 convolutional layers and 2 fully connected layers, where the convolutional layers are served as the feature detector/extractor and the fully connected layers are served as the homography estimator. However, it lacks the feature matching and outlier removal steps, which are essential in the tradition homography estimation pipeline. Li et al. [17] propose the SRHEN model, which explicitly implements the feature matching step through a correspondence layer. SRHEN performs much better than HomographyNet, which proves the importance of the feature matching step. However, the output of the correspondence layer in SRHEN contains many outliers. It is quite arduous for a homography estimator consisting of only fully connected layers to reject these outliers on its own.

In this paper, we propose a Self-Supervised Outlier Removal Network (SSORN) for robust homography estimation. The model is built upon a Siamese structure. It consists of four components, i.e. a feature extractor, a feature matching module, an outlier removal module, and a homography estimator, to mimic all four steps in the traditional homography estimation pipeline. The model first projects two input images into the same deep space and obtains two feature maps. Then, the feature matching module, which is based on the cost volume technique [31], computes a cost volume between the extracted feature maps. The cost volume generated by the feature matching module usually contains many outliers. To remove the outliers in the cost volume, we treat this outlier removal problem as a denoising problem and propose a novel self-supervised loss to train the outlier removal module. Finally, the homography estimator learns the mapping from a clean cost volume produced by the outlier removal module to the homography matrix. Table I summaries the difference between our model and existing deep learning models. Experiments on both synthetic and real dataset show that the proposed model outperforms existing deep learning models.

	Feature Extractor	Feature Matching	Outlier Removal	Homography Estimator
HomographyNet [8]	✓			✓
UnsupervisedNet [25]	✓			✓
Zhang et al. [36]	✓			✓
Koguciuk et al. [15]	✓			✓
SRHEN [17]	✓	✓		✓
Ours	✓	✓	✓	✓

TABLE I: Difference between our model and existing deep learning models. Our model is the only one that contains all four steps in the traditional homography estimation pipeline.

Fig. 1: The overall architecture of the proposed model. The model is built upon a Siamese structure. It mimics the traditional homography estimation pipeline using four components: a feature extractor, a feature matching module, an outlier removal module and a homography estimator.

Ii Related Work

In this section, we mainly focus on end-to-end homography estimation models. These models can be divided into supervised models and unsupervised models.

DeTone et al. [8] proposes the first supervised deep learning model, named HomographyNet, for homography estimation. Their model employs a VGG-like network structure with eight outputs on top (corresponding to eight parameters of a homography represented by the 4-point parameterization [1]). It takes a concatenated image as input and regresses the homography directly. They also propose a data generation rule for generating training and test datasets based on the COCO dataset [18] for training and evaluating homography estimation models. Japkowicz et al. [12] proposes a hierarchical model consisting of multiple DNN models with the same structure to handle large transformations. Each DNN model is trained using image pairs with different scale of transformations, enabling the entire model to perform homography estimation in a coarse-to-fine fashion. Li et al. [17] adopts a similar coarse-to-fine framework, but they use different feature maps at different stages and construct multi-level correspondence layers to mimic feature matching step. The above models are designed to solve the general homography estimation problem, while there are other models designed for specific scenarios, such as dynamic scenes [16], cross-resolution scenes [29] and scenes with multiple planes [26].

Nguyen et al. [25] treats the homography estimation problem as an unsupervised learning task. The predicted homography, as an intermediate output variable of their model, is used to warp one of the input images. Then, they compute the pixel-wise intensity loss between the warped image and another input image. By minimizing this loss, the model can be trained without ground truth labels. However, the pixel-level intensity loss is very sensitive to illumination changes. For this reason, Zhang et al. [36] proposes to compute the loss in the feature space rather than the intensity space. Koguciuk et al. [15] further extends this idea by computing the perceptual loss [13], which significantly improves the robustness of their model against illumination changes. Ye et al. [35] proposes to use a homography flow, instead of the commonly used 4-point parameterization, as an intermediate representation of the homography. While their model produces good results on image pairs with small viewpoint changes, it cannot handle image pairs with large viewpoint changes.

Iii Ssorn

Iii-a Network Architecture

Our model is built upon a Siamese structure. It takes two grayscale images $I_{a}$ and $I_{b}$ as input, and produces a homography matrix $H_{a b}$ from $I_{a}$ to $I_{b}$ as output. The entire architecture is composed of four components: a feature extractor $f (\cdot)$ , a feature matching module $m (\cdot)$ , an outlier removal module $r (\cdot)$ , and a homography estimator $h (\cdot)$ . The resulting model is illustrated in Fig. 1.

Feature extractor. The backbone of the feature extractor follows the ResNet-34 structure. We use the output of layer2 in the ResNet-34. For a pair of input images $I_{a}$ and $I_{b}$ of size $H \times W \times 1$ , the feature extractor $f (\cdot)$ produces feature maps $F_{a}$ and $F_{b}$ of size $H / 8 \times W / 8 \times C$ :

F_{a} = f (I_{a}), F_{b} = f (I_{b})

(1)

Feature matching module. Since homography is characterized by pixel-to-pixel correspondences rather than image features [11], feature matching (or establishing correspondences) is an important step in homography estimation. However, the way of establishing pixel-to-pixel correspondences in the traditional homography estimation pipeline is a non-differentiable operation. An alternative solution for feature matching in deep models is the cost volume technique, which has been widely used in optical flow estimation [31]. The feature matching module $m (\cdot)$ adopts the cost volume technique to bridge the gap between image features and homography. It computes a cost volume $C$ between the extracted feature maps $F_{a}$ and $F_{b}$ :

C = m (F_{a}, F_{b})

(2)

Outlier removal module. Removing outliers is crucial for robust homography estimation. Traditional outlier removal algorithms like RANSAC [10] or more recent algorithms like DSAC [5] are designed to deal with outliers in pixel-to-pixel correspondences. Unfortunately, these algorithms are not able to handle outliers in a cost volume. In this paper, we treat the problem of removing outliers in a cost volume as a denoising problem and solve it using an outlier removal module. The outlier removal model $r (\cdot)$ takes the cost volume $C$ as input and produces a new cost volume $C^{'}$ as output:

C^{'} = r (C)

(3)

Homography estimator. The homography estimator simply consists of two fully connected layers. It learns the mapping function from the cost volume $C^{'}$ to the homography matrix $H_{a b}$ :

H_{a b} = h (C^{'})

(4)

Fig. 2: The feature matching module adopts the cost volume technique. It explicitly computes a cost volume between two feature maps. We show an example of the cost volume in 2D form on the right.

Fig. 3: The outlier removal model uses the UNet structure as the backbone. It successfully removes most of the outliers in the original cost volume $C$ produced by the feature matching module, obtaining a clean cost volume $C^{'}$ .

Iii-B Feature Matching Module

Unlike other components in our model, the feature matching module has no trainable parameters. It explicitly computes a cost volume $C$ between two feature maps $F_{a}$ and $F_{b}$ . Intuitively, the cost volume can be thought of as a similarity matrix in 3D form. It stores dense feature matching costs of two sets of feature vectors. The process of computing the cost volume is shown in Fig. 2. For clarity, we use the superscripts “2D” and “3D” to denote 2D and 3D tensors, respectively.

To compute the 3D cost volume $C^{3 D}$ from $F_{a}^{3 D}$ to $F_{b}^{3 D}$ , we first reshape $F_{a}^{3 D}$ and $F_{b}^{3 D}$ into corresponding 2D tensors $F_{a}^{2 D}$ and $F_{b}^{2 D}$ respectively. Then, the matching cost $C^{2 D} (i, j)$ between the $i -$ th feature vector in $F_{a}^{2 D}$ and the $j -$ th feature vector in $F_{b}^{2 D}$ is implemented as the correlation between the feature vectors:

C^{2 D} (i, j) = \frac{1}{D} (F_{a}^{2 D} (i))^{T} ⊙ F_{b}^{2 D} (j)

(5)

where $⊙$ denotes the dot product, $T$ is the transpose operator, and $D$ is the dimension of feature vectors. Accordingly, the full cost volume $C^{2 D}$ is defined as:

C^{2 D} = \frac{1}{D} (F_{a}^{2 D})^{T} \otimes F_{b}^{2 D}

(6)

where $\otimes$ denotes the matrix product. Finally, the 2D cost volume $C^{2 D}$ will be reshaped into the corresponding 3D cost volume $C^{3 D}$ .

Iii-C Outlier Removal Module

In the traditional view, outliers usually mean incorrect correspondences. In a similar spirit, we can treat those incorrect matching costs in a cost volume as outliers in the cost volume. Since a cost volume stores dense matching costs, it often contains a large number of outliers due to the existence of many similar image features. It is difficult for the homography estimator to predict an exact homography matrix from this cost volume. To this end, we propose the outlier removal model, which treats the outlier removal problem as a cost volume denoising problem. We believe that, for a specific homography matrix, there exists a specific pattern (i.e. a clean cost volume) corresponding to that homography matrix. The goal of the outlier removal module is to recover this pattern from the noisy cost volume produced by the feature matching module.

Apparently, the cost volume is different from natural images, thus the priors for the natural image denoising problem are not suitable for the cost volume denoising problem. To address the training problem, we propose a novel self-supervised loss function as described in Sec. III-D. Theoretically, any deep denoising model can be used as the outlier removal module. In practice, we adopt the UNet structure as the backbone of the outlier removal model. After training with the proposed self-supervised loss function, the outlier removal module is able to remove most of the outliers in the original cost volume $C$ produced by the feature matching module, obtaining a clean cost volume $C^{'}$ , as shown in Fig. 3. As expected, for a specific homography matrix, the outlier removal module successfully finds a specific pattern (i.e. the clean cost volume $C^{'}$ ) corresponding to that homography matrix.

Fig. 4: The training process of the whole model. Our model requires two image pairs instead of one image pair for training. The two image pairs are related by the same ground truth homography matrix $H^{g t}$ . The training loss consists of three self-supervised signals (Eq. 8) and two supervised signals (Eq. 9). For these two image pairs, the two cost volumes $C_{a b}$ and $C_{c d}$ directly produced by the feature matching module may look very different, but the two clean cost volumes $C_{a b}^{'}$ and $C_{c d}^{'}$ produced by the outlier removal module always look the same.

Iii-D Self-Supervised Training

Given a pair of images $I_{a}$ and $I_{b}$ and the ground truth homography matrix $H_{a b}^{g t}$ , one simple way to train the model is to use the following supervised loss:

L = ∥ H_{a b} - H_{a b}^{g t} ∥_{2}^{2}

(7)

where $∥ \cdot ∥_{2}^{2}$ is L2 norm.

However, this way does not take full advantage of the outlier removal module, because the above loss function does not ensure that the outlier removal module learns the ability to remove outliers in the cost volume. To this end, we propose an additional self-supervised loss function to train the outlier removal module. We exploit the fact that two image pairs related by the same homography matrix should correspond to a similar clean cost volume. We force the outlier removal module to produce similar outputs for these two image pairs. Fig. 4 illustrates the training process of the whole model.

Formally, given two image pairs $(I_{a}, I_{b})$ and $(I_{c}, I_{d})$ related by the same ground truth homography matrix $H^{g t}$ , the self-supervised loss for training the outlier removal module is designed as:

	$L_{s s} =$	$λ_{1} ∥ C_{a b}^{'} - C_{c d}^{'} ∥_{1}$		(8)
		$+ λ_{2} (∥ C_{a b}^{'} - C_{a b} ∥_{1} + ∥ C_{c d}^{'} - C_{c d} ∥_{1})$		(8)

where $∥ \cdot ∥_{1}$ is L1 norm. The first half of the equation forces the outlier removal module to produce similar cost volumes, and the second half of the equation is used to avoid trivial solutions, i.e. $C_{a b}^{'} = C_{c d}^{'} = 0$ .

To train the homography estimator, we still need the following supervised loss:

L_{s} = ∥ H_{a b} - H^{g t} ∥_{2}^{2} + ∥ H_{c d} - H^{g t} ∥_{2}^{2}

(9)

Combining the above two losses, we get the final loss for training the whole model:

L_{f} = L_{s} + L_{s s}

(10)

Iii-E Discussion

If the idea that a specific homography corresponds to a specific clean cost volume is true, then we seem to be able to derive the grund truth cost volume directly from a ground truth homography, and thus allowing for simple supervised training. In theory, given a ground truth homography and the coordinate of a pixel on one image, we can always compute the coordinate of the unique corresponding pixel on another image, from which we can derive the ground truth cost volume. However, this derivation does not take into account the local self-similarity property of images, that is, adjacent pixels usually have similar (deep) features. In other words, a pixel on one image often has multiple high-scoring matches on another image, not just one match (see the visualization of the clean cost volume in Fig. 3). It is not trivial to take the local self-similarity property of images into account when deriving the ground truth cost volume from a ground truth homography. This motivates us to propose the self-supervised training, which avoids the difficulty of directly deriving the ground truth cost volume.

Iv Experiments

Fig. 5: Some examples from the datasets used in experiments. Each example consists of two images related by a specific homography matrix. The goal of homography estimation is to predict the homography matrix between the input image pair.

Iv-a Datasets and Evaluation Metric

Although there exist several datasets [22, 2, 21] for homography estimation, these datasets are too small for training deep homography estimation models. To this end, Detone et al. [8] proposed a synthetic dataset S-COCO based on the COCO dataset [18]. S-COCO is generated by applying random projective transformations to the COCO dataset. By using the same data generation rule, we can easily generate two image pairs with the same homography matrix for training our model.

Based on S-COCO, we additionally generate two datasets, called I-COCO and O-COCO, for testing the robustness of different models to illumination and occlusion. In I-COCO, we apply the photometric distortion techniques in [19] to S-COCO. The illumination change is controlled by a parameter $δ$ ranging from 8 to 32 with an interval of 8. In O-COCO, we simulate occlusion by overlaying a random image patch of size $p$ on one of the input images, where $p$ ranges from 0 to 80 with an interval of 10.

In addition to the above three synthetic datasets, we also conduct experiments on a real dataset, HPatches [2]. The dataset contains two subsets: the illumination subset (i.e. I-HPatches) and the viewpoint subset (i.e. V-HPatches). As this dataset only contains 580 pairs of images, we only use it for testing.

Fig. 5 shows some examples from the above datasets. All images are resized to $256 \times 256$ for training and testing. To measure the performance of different models, we adopt the Mean Average Corner Error (MACE) [8] as the evaluation metric.

Iv-B Experimental Setup

We compare our model with 6 recently proposed deep learning models (i.e. HomographyNet [8], UnsupervisedNet [25], SRHEN [17], Zhang’s et al. [36] and Koguciuk’s et al. [15] models) and two representative keypoint based methods (i.e. SIFT [20] and SuperPoint [9]).

To ensure that the performance difference between deep learning methods comes from better network structure and loss function design, rather than better CNN backbone, we implement these methods to use the same ResNet34 structure as the backbone. Using the same backbone and the same training settings, most deep learning models achieve very similar performance, but this is not the case for SRHEN and our model (see experiments in the following sections). For keypoint based methods, we compare two feature matchers: the nearest neighbor (NN) with Lowe’s ratio test (ratio is set to 0.9) and SuperGlue (SG). MAGSAC++ [3] is used as the robust estimator.

We adopt the 4-point parameterization [1] to represent the homography matrix instead of $3 \times 3$ matrix form. The weight parameters $λ_{1}$ and $λ_{2}$ in Eq. 8 are empirically set to 0.5 and 0.25, respectively.

Fig. 6: MACE for different methods on the S-COCO dataset. Lower MACE means better performance. NN - nearest neighbor. SG - SuperGlue.

Fig. 7: Results of different methods under four types of geometric transformations: (a) Translation, (b) Scale, (c) Rotation, and (d) Perspective.

Iv-C Results on S-COCO Dataset

We first conduct experiments on the S-COCO dataset. Fig. 6 reports the MACE of different methods on this dataset. In general, deep learning methods outperform keypoint based methods on this dataset, mainly due to the powerful feature representation capability of deep learning features. Compared with those deep models without the feature matching module, SRHEN with the feature matching module achieves a better result. This demonstrates the necessity of the feature matching module not only in the traditional homography estimation pipeline but also in deep homography estimation models. Our model contains not only the feature matching module but also an outlier removal module. It leverages the strength of the traditional homography estimation pipeline and the strength of deep learning features. It obtains a much better result than SRHEN on the S-COCO dataset, which indicates that the outlier removal module is as important as the feature matching module for deep homography estimation models.

To analyze the robustness of different models to different geometric transformations, we generated four subsets according to the data generation rule in [8]. Each subset corresponds to a specific type of transformation, namely translation, scale, rotation and perspective transformaitons. We plot the results in Fig. 7. It manifests that all methods perform well when the magnitude of the geometric transformation is small. However, the performance degrades as the magnitude of the geometric transformation increases. Specifically, deep learning models that do not employ the feature matching module perform poorly for large geometric transformations, especially for large translations. Overall, our model is more robust to these four types of geometric transformations than other methods.

Fig. 8: Results of the first experimental configuration: all models are trained on the S-COCO dataset and tested on the (a) I-COCO dataset and (b) O-COCO dataset.

Fig. 9: Results of the second experimental configuration: all models are (a) trained and tested on the I-COCO dataset and (b) trained and tested on the O-COCO dataset.

Iv-D Robustness to Occlusion and Illumination

To investigate the robustness or the generalization ability of deep learning models to illumination and occlusion, we conduct experiments on the I-COCO and O-COCO datasets. We employ two experimental configurations. In the first experiment, we train all deep learning models on the S-COCO dataset and test them on the I-COCO/O-COCO dataset. In the second experiment, all models are trained and tested on the I-COCO/O-COCO dataset. The experimental results are shown in Fig. 8 and Fig. 9.

Interestingly, we observe that the deep learning models (i.e. HomographyNet, UnsupervisedNet, Zhang’s et al. and Kugociuk’s et al. models) without the feature matching module obtain significantly different results under the two experimental configurations. This demonstrates that learning the mapping function directly from image features to the homography matrix may lead to the generalization problem. In contrast, SRHEN and our model achieve more stable results under these two experimental configurations. Compared to SRHEN, our model is more robust to illumination and occlusion, as the outlier removal module is able to recover the “correct” clean cost volume in most cases (see Fig. 10).

Iv-E Results on HPatches Dataset

In the previous sections, we have tested different methods on three synthetic datasets. In order to investigate the generalization ability of these methods to real images, we conduct experiments on the HPatches dataset. Since there are no large-scale real dataset for training the deep learning models, we train all the models on the S-COCO dataset and test them on the HPatches dataset. The experimental results are reported in Fig. 11.

As shown in Fig. 11, our model achieves a distinct advantages on the HPatches dataset among all deep learning models. However, it is inferior to SIFT and SuperPoint. A major reason may be that the distributions of training images and test images are quite different. The images in the HPatches dataset contains a lot of local illumination variations and non-planar regions. These challenging factors are difficult to simulate by synthetic means. We show some success and failure cases of our model in Fig. 12.

Fig. 10: All image pairs in this figure are related by the same homography matrix. While the cost volume $C$ produced by the feature matching module may be different due to illumination changes (the second row) or occlusions (the third row), the outlier removal model is able to remove most of the outliers in $C$ and produce the “correct” clean cost volume $C^{'}$ .

Fig. 11: MACE for different methods on the HPatches dataset. Lower MACE means better performance.

Fig. 12: Some success and failure cases of our model on the HPatches dataset. We warp the image $I_{b}$ with the inverse of the predicted homography matrix $H_{a b}$ . The more accurate the predicted homography matrix, the more aligned $I_{a}$ and the warped image will be.

Currently, deep learning models do not achieve satisfactory results on the HPatches dataset. However, they have the advantage of being able to generalize to a variety of scenarios when training on a large number of images, as demonstrated by previous experiments on the synthetic datasets. We believe that the performance of our model, as well as other deep learning models, on the HPatches dataset will be further improved if large-scale real dataset are available for training in the future.

	FH	FMH	FMRH-s	FMRH-ss
MACE	1.79	1.38	1.26	0.73

TABLE II: MACE for four variants of our model on the S-COCO dataset. Lower MACE means better performance.

Iv-F Ablation Study

In this section we conduct experiments to investigate the effectiveness of each proposed technique in our model. To this end, we perform ablation experiments on four variants of our model:

FH, which only contains the feature extractor and the homography estimation. It learns the mapping function from image features to the homography matrix.
FMH, which augments FH with a feature matching module. It learns the mapping function from the cost volume produced by the feature matching module to the homography matrix.
FMRH-s, which further augments FMH with an outlier removal module. It mimics the traditional homography estimation pipeline in a single deep learning model. It is trained with the training loss in Eq. 9.
FMRH-ss, has exactly the same network structure as FMRH-s, but is trained with the training loss in Eq. 8.

Table II reports comparison results on the S-COCO dataset between these four variants. FH, without using the feature matching module and the outlier removal module, is similar to HomographyNet with different feature learning structures. Thus, FH performs similarly to HomographyNet on the S-COCO dataset. The performance comparison between FH and FMH manifests the benefit of leveraging the feature matching module to bridge the gap between image features and the homography matrix. Although FMRH-s is further equiped with an outlier removal module based on FMH, the outlier removal module can not be well trained with the supervised loss in Eq. 9. As a consequence, FMRH-s is only slightly better than FMH. The performance is improved marginally from FMRH-s to FMRH-ss, which is benefited from the proposed self-supervised loss for training the outlier removal module.

Besides the network structure and the loss function, in practice, we find that the backbone of the feature extractor and the outlier removal module also has a significant impact on the performance of our model. Early deep learning models usually employ VGG as the backbone of the feature extractor. However, recent studies [36, 15] have shown that replacing VGG with ResNet-34 always results in better performance in the homography estimation task. Therefore, we choose ResNet-34 as the backbone of the feature extractor in our model. Using the ResNet-34, our model achieves a MACE of 0.73 on the S-COCO dataset. While using the VGG, the MACE is 2.36. For fair comparison, in previous experiments, we implemented all deep learning models to use the same ResNet-34 as the backbone of the feature extractor. As for the outlier removal module,we started with a classic deep denoising model DnCNN [37], but later found that this model performed slightly worse than the currently used model UNet.

V Conclusion

In this work, we have presented a new deep learning model that mimics the traditional homography estimation pipeline. Specifically, the proposed model consists of four components: a feature extractor, a feature matching module, an outlier removal module, and a homography estimator. The feature matching module is implemented using the cost volume technique and has no trainable parameters. The outlier removal module is built based on the UNet structure to remove outliers in the cost volume. We propose a novel self-supervised loss to train the outlier removal module. The entire model can be trained in an end-to-end fashion with a combination of a supervised and the self-supervised loss. Extensive experiments on synthetic and real datasets show that the proposed model significantly outperforms existing deep learning models.

References

[1] S. Baker, A. Datta, and T. Kanade (2006-03) Parameterizing homographies. Technical report Technical Report CMU-RI-TR-06-11, Carnegie Mellon University. Cited by: §II, §IV-B.
[2] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5173–5182. Cited by: §IV-A, §IV-A.
[3] D. Barath, J. Matas, and J. Noskova (2019) MAGSAC: marginalizing sample consensus. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10197–10205. Cited by: §IV-B.
[4] H. Bay, T. Tuytelaars, and L. V. Gool (2006) Surf: speeded up robust features. In European Conference on Computer Vision, pp. 404–417. Cited by: §I.
[5] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother (2017) DSAC-differentiable ransac for camera localization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6684–6692. Cited by: §I, §III-A.
[6] C. Chang, C. Chou, and E. Y. Chang (2017) CLKN: cascaded lucas-kanade networks for image alignment. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3777–3785. Cited by: §I.
[7] B. Chung and C. Yim (2019) Bi-sequential video error concealment method using adaptive homography-based registration. IEEE Transactions on Circuits and Systems for Video Technology 30 (6), pp. 1535–1549. Cited by: §I.
[8] D. DeTone, T. Malisiewicz, and A. Rabinovich (2016) Deep image homography estimation. arXiv. Cited by: TABLE I, §I, §II, §IV-A, §IV-A, §IV-B, §IV-C.
[9] D. Detone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: self-supervised interest point detection and description. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236. Cited by: §I, §IV-B.
[10] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §I, §III-A.
[11] R. Hartley and A. Zisserman (2000) Multiple view geometry in computer vision. Cambridge University Press. Cited by: §III-A.
[12] N. Japkowicz, F. E. Nowruzi, and R. Laganiere (2017) Homography estimation from image pairs with hierarchical convolutional networks. In IEEE International Conference on Computer Vision Workshops, pp. 913–920. Cited by: §II.
[13] J. Justin, A. Alexandre, and F. Li (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §II.
[14] G. Klein and D. W. Murray (2007) Parallel tracking and mapping for small ar workspaces. In IEEE International Symposium on Mixed and Augmented Reality, pp. 1–10. Cited by: §I.
[15] D. Koguciuk, E. Arani, and B. Zonooz (2021) Perceptual loss for robust unsupervised homography estimation. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 4274–4283. Cited by: TABLE I, §II, §IV-B, §IV-F.
[16] H. Le, F. Liu, S. Zhang, and A. Agarwala (2020) Deep homography estimation for dynamic scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7652–7661. Cited by: §II.
[17] Y. Li, W. Pei, and Z. He (2020) SRHEN: stepwise-refining homography estimation network via parsing geometric correspondences in deep latent space. In ACM International Conference on Multimedia, pp. 3063–3071. Cited by: TABLE I, §I, §II, §IV-B.
[18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §II, §IV-A.
[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision, pp. 21–37. Cited by: §IV-A.
[20] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §I, §IV-B.
[21] J. Ma, X. Jiang, A. Fan, J. Jiang, and J. Yan (2021) Image matching from handcrafted to deep features: a survey. International Journal of Computer Vision 129 (1), pp. 23–79. Cited by: §IV-A.
[22] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool (2005) A comparison of affine region detectors. International Journal of Computer Vision 65 (1), pp. 43–72. Cited by: §IV-A.
[23] R. Murartal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. Cited by: §I.
[24] R. Murartal and J. D. Tardos (2017) ORB-slam2: an open-source slam system for monocular, stereo, and rgb-d cametas. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §I.
[25] T. Nguyen, S. W. Chen, S. S. Shivakumar, C. J. Taylor, and V. Kumar (2018) Unsupervised deep homography: a fast and robust homography estimation model. In IEEE International Conference on Robotics and Automation, pp. 2346–2353. Cited by: TABLE I, §II, §IV-B.
[26] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao (2021) Depth-aware multi-grid deep homography estimation with contextual correlation. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I, §II.
[27] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to sift or surf. In IEEE International Conference on Computer Vision, pp. 1–9. Cited by: §I.
[28] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020) Superglue: learning feature matching with graph neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4938–4947. Cited by: §I.
[29] R. Shao, G. Wu, Y. Zhou, Y. Fu, L. Fang, and Y. Liu (2021) LocalTrans: a multiscale local transformer network for cross-resolution homography estimation. arXiv. Cited by: §II.
[30] X. Shao, L. Zhang, T. Zhang, Y. Shen, and Y. Zhou (2021) MOFISSLAM: a multi-object semantic slam system with front-view, inertial and surround-view sensors for indoor parking. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
[31] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §I, §III-A.
[32] F. Tang, Y. Wu, X. Hou, and H. Ling (2019) 3d mapping and 6d pose computation for real time augmented reality on cylindrical objects. IEEE Transactions on Circuits and Systems for Video Technology 30 (9), pp. 2887–2899. Cited by: §I.
[33] W. Xue, W. Xie, Y. Zhang, and S. Chen (2021) Stable linear structures and seam measurements for parallax image stitching. IEEE Transactions on Circuits and Systems for Video Technology 32 (1), pp. 253–261. Cited by: §I.
[34] J. Ye, S. Zhang, T. Huang, and Y. Rui (2019) CDbin: compact discriminative binary descriptor learned with efficient neural network. IEEE Transactions on Circuits and Systems for Video Technology 30 (3), pp. 862–874. Cited by: §I.
[35] N. Ye, C. Wang, H. Fan, and S. Liu (2021) Motion basis learning for unsupervised deep homography estimation with subspace projection. arXiv. Cited by: §II.
[36] J. Zhang, C. Wang, S. Liu, L. Jia, N. Ye, J. Wang, J. Zhou, and J. Sun (2020) Content-aware unsupervised deep homography estimation. In European Conference on Computer Vision, pp. 653–669. Cited by: TABLE I, §II, §IV-B, §IV-F.
[37] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §IV-F.
[38] X. Zhao, J. Liu, X. Wu, W. Chen, F. Guo, and Z. Li (2021) Probabilistic spatial distribution prior based attentional keypoints matching network. IEEE Transactions on Circuits and Systems for Video Technology 32 (3), pp. 1313–1327. Cited by: §I.
[39] F. Zhou, H. B. Duh, and M. Billinghurst (2008) Trends in augmented reality tracking, interaction and display: a review of ten years of ismar. In IEEE International Symposium on Mixed and Augmented Reality, pp. 193–202. Cited by: §I.