Native Segmentation Vision Transformers

¹ NVIDIA

description arXiv description BibTeX description GitHub

SeNaTra is a segmentation-centric backbone architecture built around a spatial grouping layer, capable of generating high-quality segmentation masks through feature extraction, and enabling the emergence of hierarchical segmentation without mask supervision.

Input Image

Local Groups 1

Local Groups 2

Final Dense groups

Class activations

Abstract

Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Segmentation Native Transformer (SeNaTra). We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks.

Overview

SeNaTra introduces a new family of hierarchical vision backbones, similar in structure to Swin Transformers, with four feature extraction and downsampling stages producing multi-scale features. The key innovation lies in replacing conventional downsampling layers with spatial grouping layers that leverage iterative attention mechanisms to learn content-dependent token assignments. These assignments preserve boundaries akin to superpixels and compose into a hierarchical image segmentation throughout the backbone. This segmentation can be reversed for upsampling, enabling native segmentation capabilities without additional heads or modules.

Emerging Segmentation from ImageNet Pre-training

SeNaTra can be pre-trained without mask supervision from either class labels or image-text pairs. In both cases, we observe the emergence of meaningful segmentation masks through pre-training. Below, we show the emerging pixel hierarchies learned exclusively during ImageNet pre-training with class label supervision, along with their class activations:

Input Image

Local Groups 1

Local Groups 2

Final Dense groups

Class activations

Text-supervised Zero-shot Semantic Segmentation

We prompt our SeNaTra model trained on image-text pairs with arbitrary class vocabularies specified as text queries to perform zero-shot semantic segmentation. Thanks to SeNaTra's native segmentation capabilities, we obtain large improvements over state-of-the-art zero-shot methods. Below we visualize example segmentation learned only from image-text pairs, without any mask supervision:

Input Image

Local Groups 1

Local Groups 2

Dense groups

Predicted masks

GT masks

Native Backbone-level Segmentation

SeNaTra enables us to streamline panoptic and semantic segmentation models without relying on external segmentation heads such as Mask2Former, and instead just classifying our backbone's final pixel groups. We refer to this approach as native segmentation. We show empirically that this design outperforms existing backbones used with strong baselines at a significantly reduced parameter and FLOP count. SeNaTra can also be used with existing heads, resulting in further improvements over existing backbones. Overall, SeNaTra introduces a new minimalistic paradigm for segmentation model design.

Mask2Former

Native Segmentation

Main Results

Zero-shot Semantic Segmentation

Zero-shot, text-supervised semantic segmentation. We compare our text-supervised zero-shot method to state-of-the-art methods on six datasets, and report average mIoU across datasets where applicable. We bolden top-performers, and underline 2^nd, and indicate postprocessing techniques (CRF, PAMR). SeNaTra outperforms previous models trained from scratch by large margins, and even surpasses or performs on par with CLIP-based models on most datasets, despite being trained on an order of magnitude less data.

Downstream Segmentation Performance

Semantic Segmentation on ADE20k val

Panoptic Segmentation on COCO-panoptic val

Downstream semantic and panoptic segmentation after fine-tuning. We fine-tune ImageNet-pretrained SeNaTra models for downstream panoptic and semantic segmentation and evaluate both standalone native masks, as well as masks obtained with an additional Mask2Former head. We observe that our native masks outperform Mask2Former masks, and that our native masks outperform strong existing heads, and our Mask2Former-based masks surpass previous backbones.

Citation


    @article{braso2025native,
        title={{Native Segmentation Vision Transformers}},
        author={Brasó, Guillem and Ošep, Aljoša and Leal-Taixé, Laura},
        journal={arXiv preprint arXiv:2505.16993},
        year={2025}
    }

Paper

Native Segmentation Vision Transformers

Guillem Brasó, Aljosa Osep, Laura Leal-Taixé

description Paper

insert_comment BibTeX

Acknowledgment

We are grateful to Tim Meinhardt for his feedback on the paper and his insightful comments.