RDD: Robust Feature Detector and Descriptor using Deformable Transformer

Figure 1. Our proposed method effectively performs both sparse and dense feature matching, referred to as RDD and RDD*, respectively, as shown in the top section. RDD demonstrates its ability to extract accurate keypoints and robust descriptors, enabling reliable matching even under significant scale and viewpoint variations, as illustrated in the bottom section.

Abstract

As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we collected an Air-to-Ground dataset for training in addition to the standard MegaDepth dataset. Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and is also capable of semi-dense matching. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being an Air-to-Ground benchmark — an evaluation setting that has recently gaining popularity for 3D reconstruction across different altitudes.

Results

Relative Pose Estimation

Figure 2. Qualitative Results on Three Benchmarks. RDD* and RDD are qualitatively compared to DeDoDe-V2-G, ALIKED, and XFeat*. RDD and RDD* are more robust compared DeDoDe-V2-G and ALIKED under challenging scenarios like large scale and viewpoint changes. The red color indicates epipolar error beyond 1 × 10⁻⁴ (in the normalized image coordinates)

Megadepth

Table 1. SotA comparison on the MegaDepth. Results are measured in AUC (higher is better). Top 4,096 features used for all sparse matching methods. Best in bold, second best underlined.

Air-to-Ground

Table 2. SotA comparison on Air-to-Ground benchmark. Keypoints and descriptors are matched using dual-softmax MNN. Measured in AUC (higher is better). Best in bold, second best underlined.

Table 3. More results on Air-to-Ground benchmark. Results are measured in AUC (higher is better). Best in bold, second best underlined.

Method

An overview of our network architecture. Descriptor Branch 𝔽_D and Keypoint Branch 𝔽_K process an input image I ∈ ℝ^{H × W × 3} independently. Descriptor Branch: 4 layers of multiscale feature maps {𝘹_res^l}_l=1^L are extracted by passing I through a variant of ResNet 𝔽_res. An additional feature map is added by applying a simple CNN on the last feature map, and then they are fed to a transformer encoder 𝔽_e with positional embeddings. We up-sample all feature maps output by 𝔽_e to size H/s × W/s where s = 4 is the patch size. Feature maps are then summed together to generate the dense descriptor map D. A CNN head 𝔽_m is applied to the descriptor map to estimate a matchability map M. Keypoint Branch: I passes through a simplified variant of ResNet 𝔽_cnn to capture multi-scale features {𝘹_cnn^l}_l=1^L. Features are then up-sampled to size H × W and concatenated to generate a feature map of H × W × 64. A score map S is estimated by a CNN head 𝔽_s. Final sub-pixel keypoints are detected using DKD

Datasets

We introduce one training dataset Air-to-Ground and two new benchmarks to evaluate the performance of our proposed method. The first benchmark is designed to evaluate the performance of keypoint detectors and descriptors under large viewpoint and scale variations. The second benchmark is an Air-to-Ground benchmark, which is designed to evaluate the performance of keypoint detectors and descriptors under different altitudes. We also use the MegaDepth dataset for evaluation.

Figure 3. Example Pairs from MegaDepth-View and Air-to-Ground. The top section shows example pairs from the MegaDepth-View benchmark, which emphasizes large viewpoint shifts and scale differences. The bottom section presents example pairs from the Air-to-Ground dataset/benchmark, designed for the novel task of matching aerial images with ground images.

Citation


@InProceedings{Chen_2025_CVPR,
    author    = {Chen, Gonglin and Fu, Tianwen and Chen, Haiwei and Teng, Wenbin and Xiao, Hanyuan and Zhao, Yajie},
    title     = {RDD: Robust Feature Detector and Descriptor using Deformable Transformer},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {6394-6403}
}

Acknowledgments

Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number 140D0423C0075. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government. We would like to thank Yayue Chen for her help with visualization.

RDD: Robust Feature Detector and Descriptor using Deformable Transformer

CVPR 2025

Gonglin Chen^1,2, Tianwen Fu^1,2, Haiwei Chen^1,2, Wenbin Teng^1,2, Hanyuan Xiao^1,2, Yajie Zhao^1,2

Abstract

Results

Relative Pose Estimation

Megadepth

Air-to-Ground

Method

Datasets

Citation

Acknowledgments

RDD: Robust Feature Detector and Descriptor using Deformable Transformer

CVPR 2025

Gonglin Chen1,2, Tianwen Fu1,2, Haiwei Chen1,2, Wenbin Teng1,2, Hanyuan Xiao1,2, Yajie Zhao1,2

Abstract

Results

Relative Pose Estimation

Megadepth

Air-to-Ground

Method

Datasets

Citation

Acknowledgments

Gonglin Chen^1,2, Tianwen Fu^1,2, Haiwei Chen^1,2, Wenbin Teng^1,2, Hanyuan Xiao^1,2, Yajie Zhao^1,2