SegMASt3R: Geometry Grounded Segment Matching

Advances in Neural Information Processing Systems (NeurIPS), 2025
Spotlight 🌟
1IIIT Hyderabad, India      2Heidelberg University, Germany      3MBZUAI, UAE
* Denotes equal contribution

Abstract

Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180° rotation. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by up to 30% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance segmentation and object-relative navigation.

Method

SegMASt3R Architecture

Pipeline Overview: A wide-baseline image pair is processed by a frozen MASt3R backbone to extract patch-level features; masks from a parallel segmentation module or ground truth masks and these patch-level features are aggregated by the segment-feature head to form per-segment descriptors, which are then matched across images via a differentiable optimal-transport layer to produce the final matched segments.

Key Results

ScanNet++ Performance (AUPRC / R@1 / R@5)

Method 0°–45° 45°–90° 90°–135° 135°–180°
AUPRCR@1R@5 AUPRCR@1R@5 AUPRCR@1R@5 AUPRCR@1R@5
Local Feature Matching
SP-LG42.145.651.233.536.943.115.919.726.26.19.314.6
GiM-DKM59.164.969.754.960.266.139.644.551.821.325.932.7
RoMA61.668.773.558.966.473.047.456.165.530.039.549.7
MASt3R (LFM)59.568.374.257.365.672.552.960.368.945.452.662.2
Segment Matching
SAM261.964.667.546.650.154.027.932.537.217.021.625.4
DINOv257.966.787.443.055.983.233.548.078.032.446.075.6
SegVLAD44.258.681.432.149.576.523.242.270.520.039.666.8
MASt3R (SegMatch)51.754.669.945.649.868.541.447.969.239.548.772.6
SegMASt3R (Ours) 92.893.698.0 91.192.297.6 88.089.596.8 83.685.995.9

SegMASt3R improves AUPRC and recall across all pose-bins on ScanNet++.


Cross-Dataset Generalization on Replica

Method 0°–45° 45°–90° 90°–135° 135°–180°
AUPRCR@1R@5 AUPRCR@1R@5 AUPRCR@1R@5 AUPRCR@1R@5
Local Feature Matching
MASt3R (LFM)78.286.589.469.577.681.048.060.464.632.549.054.1
Segment Matching
MASt3R (SegMatch)52.257.581.239.151.078.623.645.977.217.243.875.7
SAM280.0982.2884.6154.5862.0365.4140.6953.7256.6737.7854.5956.42
DINOv255.8574.2596.5531.1259.6492.8421.7157.6892.3317.2959.2889.64
SegMASt3R (Ours)95.096.098.686.291.296.473.485.295.768.483.894.8

SegMASt3R maintains high AUPRC and recall across pose-bins under distribution shift.


Robustness to Noisy Masks (ScanNet / FastSAM & SAM2)

Method 0°–45° 45°–90° 90°–135° 135°–180° Overall
AUPRCR@1R@5 AUPRCR@1R@5 AUPRCR@1R@5 AUPRCR@1R@5 AUPRCR@1R@5
SAM2 (Video Prop)63.8975.1676.8150.2958.8461.531.4238.2341.7520.1925.9328.7541.4749.5652.22
MASt3R34.3740.1452.0328.6234.4345.7524.7230.6742.4223.7029.7442.2027.8633.7545.60
DINOv239.2045.6361.0925.9632.9648.8319.9127.0141.7317.3724.5537.8425.6232.5447.38
SegMASt3R (AMG / FastSAM)70.8974.4376.9468.2572.0574.5564.4468.8371.7860.3265.4868.2765.9870.2072.89

Even with noisy masks from FastSAM / SAM2, SegMASt3R maintains high precision and recall.


Outdoor Generalization: MapFree (IoU across pose-bins)

Method Train Dataset Eval Dataset Overall 0°-45° 45°-90° 90°-135° 135°-180°
DINOv2 (off-the-shelf) Multiple MapFree 84.485.485.283.583.8
MASt3R (Vanilla) Multiple MapFree 69.273.469.870.066.1
SegMASt3R (SPP) ScanNet++ MapFree 75.275.274.676.574.5
SegMASt3R (SPP + Dustbin MF) ScanNet++ MapFree 88.788.688.688.593.9
SegMASt3R (MF) MapFree MapFree 93.7 93.3 93.7 93.9 93.9

Evaluated with pseudo-GT masks from SAM2 propagation. Adding a dustbin parameter already improves indoor→outdoor transfer, while direct training on MapFree gives the best performance across all pose bins.

Qualitative Results

Wide-Baseline Segment Matching (ScanNet++)

Reference image pairs and SegMASt3R matches demonstrating robustness to 180° viewpoint changes and perceptual aliasing. Colors indicate matched segments; correct matches are contiguous across views.

Wide-baseline matching example 1
Wide-baseline matching example 2
Wide-baseline matching example 3
Wide-baseline matching example 4

Comparison with SAM2 Video Propagator

Below are several examples comparing SAM2 two-frame propagation vs SegMASt3R under extreme viewpoint changes. Each row shows: (1) reference / query image, (2) SAM2 matches, (3) SegMASt3R matches. Notice cases of instance aliasing (multiple similar objects) where SAM2 fails while SegMASt3R correctly associates.

SAM2 vs SegMASt3R comparison 1
SAM2 vs SegMASt3R comparison 2
Additional SAM2 comparison
Additional SAM2 comparison 2
SAM2 comparison - chair matching
SAM2 comparison - wheelchair propagation

Navigation Demo Videos

Watch SegMASt3R guiding RoboHop on HM3D episodes — left half is vanilla RoboHop which is using SuperPoint + LightGlue matcher; right half is with SegMASt3R as the matcher.

nav_bed_20 — Success vs Failure (SegMASt3R succeeds)

nav_chair_8 — Fewer steps & stable localization

nav_chair_55 — Success vs Failure (SegMASt3R succeeds)

nav_sofa_9 — Fewer steps and stable behavior

Short guidance: top of each video shows overhead / map view; middle shows segment visualization; bottom shows egocentric RGB. Videos illustrate improved success and fewer steps when using SegMASt3R.

Downstream Applications

3D Instance Mapping (Replica) — AP / AP@50

Method office0 office1 office2 office3 office4 room0 room1 room2
APAP@50 APAP@50 APAP@50 APAP@50 APAP@50 APAP@50 APAP@50 APAP@50
ConceptGraphs (MobileSAM masks) 11.8428.4320.3143.798.6322.828.0722.839.4624.7312.2334.345.8312.967.8323.82
ConceptGraphs (GT masks) 43.5369.6822.4840.7143.4660.6932.0653.4439.6368.2244.8969.6417.9636.5325.9343.63
SegMASt3R (Ours, GT masks) 79.9387.17 54.8964.42 64.0085.50 58.0279.93 67.4885.01 71.0291.22 64.0985.50 56.3576.66

Key takeaways:

  • Large AP gains over MobileSAM-based ConceptGraphs when using SegMASt3R (GT-mask setting), especially on office scenes.
  • Robust to re-entry: geometry-aware matching + 3D IoU filtering preserves instance identity across long trajectories.
  • Qualitative: below we show GT RGB + ConceptGraphs + SegMASt3R (two representative examples).

Qualitative: each column shows (left) ground-truth RGB, (middle) ConceptGraphs mapping (MobileSAM masks), (right) SegMASt3R mapping. Colors = object instances. Observe reduced over-segmentation and better consistent identities with SegMASt3R.

Ground truth RGB

Ground Truth RGB

ConceptGraphs baseline

ConceptGraphs Baseline

SegMASt3R result

SegMASt3R Enhanced


Technical Highlights

Key Contributions

  • 3D Foundation Model Integration: Leverages MASt3R's spatial understanding for segment matching
  • Segment-Feature Head: Novel adapter that transforms patch-level features to segment-level descriptors
  • Differentiable Matching: Optimal transport-based matching with learnable dustbin for unmatched segments
  • Wide-Baseline Robustness: Handles extreme viewpoint changes up to 180° rotation

Training Details

  • Dataset: 860k image pairs from 140 ScanNet++ scenes
  • Training time: 22 hours on single RTX A6000 GPU
  • Inference speed: 0.579 seconds per image pair
  • Architecture: Frozen MASt3R backbone + trainable segment-feature head

Citation

@inproceedings{segmast3r2025,
  title={SegMASt3R: Geometry Grounded Segment Matching},
  author={Anonymous Authors},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}