SeNaTra is a segmentation-centric backbone architecture built around a spatial grouping layer, capable of generating high-quality segmentation masks through feature extraction, and enabling the emergence of hierarchical segmentation without mask supervision.
Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Segmentation Native Transformer (SeNaTra). We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks.
SeNaTra introduces a new family of hierarchical vision backbones, similar in structure to Swin Transformers, with four feature extraction and downsampling stages producing multi-scale features. The key innovation lies in replacing conventional downsampling layers with spatial grouping layers that leverage iterative attention mechanisms to learn content-dependent token assignments. These assignments preserve boundaries akin to superpixels and compose into a hierarchical image segmentation throughout the backbone. This segmentation can be reversed for upsampling, enabling native segmentation capabilities without additional heads or modules.
SeNaTra can be pre-trained without mask supervision from either class labels or image-text pairs. In both cases, we observe the emergence of meaningful segmentation masks through pre-training. Below, we show the emerging pixel hierarchies learned exclusively during ImageNet pre-training with class label supervision, along with their class activations:
We prompt our SeNaTra model trained on image-text pairs with arbitrary class vocabularies specified as text queries to perform zero-shot semantic segmentation. Thanks to SeNaTra's native segmentation capabilities, we obtain large improvements over state-of-the-art zero-shot methods. Below we visualize example segmentation learned only from image-text pairs, without any mask supervision:
SeNaTra enables us to streamline panoptic and semantic segmentation models without relying on external segmentation heads such as Mask2Former, and instead just classifying our backbone's final pixel groups. We refer to this approach as native segmentation. We show empirically that this design outperforms existing backbones used with strong baselines at a significantly reduced parameter and FLOP count. SeNaTra can also be used with existing heads, resulting in further improvements over existing backbones. Overall, SeNaTra introduces a new minimalistic paradigm for segmentation model design.
Zero-shot, text-supervised semantic segmentation. We compare our text-supervised zero-shot method to state-of-the-art methods on six datasets, and report average mIoU across datasets where applicable. We bolden top-performers, and underline 2nd, and indicate postprocessing techniques (CRF, PAMR). SeNaTra outperforms previous models trained from scratch by large margins, and even surpasses or performs on par with CLIP-based models on most datasets, despite being trained on an order of magnitude less data.
Downstream semantic and panoptic segmentation after fine-tuning. We fine-tune ImageNet-pretrained SeNaTra models for downstream panoptic and semantic segmentation and evaluate both standalone native masks, as well as masks obtained with an additional Mask2Former head. We observe that our native masks outperform Mask2Former masks, and that our native masks outperform strong existing heads, and our Mask2Former-based masks surpass previous backbones.
@article{braso2025native,
title={{Native Segmentation Vision Transformers}},
author={Brasó, Guillem and Ošep, Aljoša and Leal-Taixé, Laura},
journal={arXiv preprint arXiv:2505.16993},
year={2025}
}
We are grateful to Tim Meinhardt for his feedback on the paper and his insightful comments.
|