RDD: Robust Feature Detector and Descriptor using Deformable Transformer

CVPR 2025

Gonglin Chen1,2, Tianwen Fu1,2, Haiwei Chen1,2, Wenbin Teng1,2, Hanyuan Xiao1,2, Yajie Zhao1,2

1Institute for Creative Technologies     2University of Southern California    

Paper teaser

Figure 1. Our proposed method effectively performs both sparse and dense feature matching, referred to as RDD and RDD*, respectively, as shown in the top section. RDD demonstrates its ability to extract accurate keypoints and robust descriptors, enabling reliable matching even under significant scale and viewpoint variations, as illustrated in the bottom section.

Abstract

As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we collected an Air-to-Ground dataset for training in addition to the standard MegaDepth dataset. Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and is also capable of semi-dense matching. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being an Air-to-Ground benchmark — an evaluation setting that has recently gaining popularity for 3D reconstruction across different altitudes.

Results

Relative Pose Estimation

Result 1

Figure 2. Qualitative Results on Three Benchmarks. RDD* and RDD are qualitatively compared to DeDoDe-V2-G, ALIKED, and XFeat*. RDD and RDD* are more robust compared DeDoDe-V2-G and ALIKED under challenging scenarios like large scale and viewpoint changes. The red color indicates epipolar error beyond 1 × 10−4 (in the normalized image coordinates)

Megadepth
Result 2

Table 1. SotA comparison on the MegaDepth. Results are measured in AUC (higher is better). Top 4,096 features used for all sparse matching methods. Best in bold, second best underlined.

Air-to-Ground
Result 1

Table 2. SotA comparison on Air-to-Ground benchmark. Keypoints and descriptors are matched using dual-softmax MNN. Measured in AUC (higher is better). Best in bold, second best underlined.

Result 1

Table 3. More results on Air-to-Ground benchmark. Results are measured in AUC (higher is better). Best in bold, second best underlined.

Method

Result 1

An overview of our network architecture. Descriptor Branch 𝔽D and Keypoint Branch 𝔽K process an input image I ∈ ℝH × W × 3 independently. Descriptor Branch: 4 layers of multiscale feature maps {𝘹resl}l=1L are extracted by passing I through a variant of ResNet 𝔽res. An additional feature map is added by applying a simple CNN on the last feature map, and then they are fed to a transformer encoder 𝔽e with positional embeddings. We up-sample all feature maps output by 𝔽e to size H/s × W/s where s = 4 is the patch size. Feature maps are then summed together to generate the dense descriptor map D. A CNN head 𝔽m is applied to the descriptor map to estimate a matchability map M. Keypoint Branch: I passes through a simplified variant of ResNet 𝔽cnn to capture multi-scale features {𝘹cnnl}l=1L. Features are then up-sampled to size H × W and concatenated to generate a feature map of H × W × 64. A score map S is estimated by a CNN head 𝔽s. Final sub-pixel keypoints are detected using DKD

Datasets

We introduce one training dataset Air-to-Ground and two new benchmarks to evaluate the performance of our proposed method. The first benchmark is designed to evaluate the performance of keypoint detectors and descriptors under large viewpoint and scale variations. The second benchmark is an Air-to-Ground benchmark, which is designed to evaluate the performance of keypoint detectors and descriptors under different altitudes. We also use the MegaDepth dataset for evaluation.

Result 1

Figure 3. Example Pairs from MegaDepth-View and Air-to-Ground. The top section shows example pairs from the MegaDepth-View benchmark, which emphasizes large viewpoint shifts and scale differences. The bottom section presents example pairs from the Air-to-Ground dataset/benchmark, designed for the novel task of matching aerial images with ground images.

Citation

@inproceedings{gonglin2025rdd,
    title     = {RDD: Robust Feature Detector and Descriptor using Deformable Transformer},
    author    = {Chen, Gonglin and Fu, Tianwen and Chen, Haiwei and Teng, Wenbin and Xiao, Hanyuan and Zhao, Yajie},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2025}
}

Acknowledgments

Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number 140D0423C0075. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government. We would like to thank Yayue Chen for her help with visualization.