This repository is the official implementation of FrozenSeg introduced in the paper:
FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation
Open-vocabulary segmentation is challenging, with the need of segmenting and recognizing objects for an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models like CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite demonstrating strong performances, they still face a fundamental challenge of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into learnable query and CLIP feature in the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a light transformer decoder for mask proposal generation – the performance bottleneck. Extensive experiments show that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data and tested in a zero-shot manner.
See installation instructions.
See Preparing Datasets.
See Getting Started.
ADE20K(A-150) | Cityscapes | Mapillary Vistas | BDD 100K | A-847 | PC-459 | PAS-21 | Lvis | COCO (training dataset) |
download | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PQ | mAP | mIoU | FWIoU | PQ | mAP | mIoU | PQ | mIoU | PQ | mIoU | mIoU | FWIoU | mIoU | FWIoU | mIoU | FWIoU | APr | PQ | mAP | mIoU | ||
FrozenSeg (ResNet50x64) | 23.1 | 13.5 | 30.7 | 56.6 | 45.2 | 28.9 | 56.0 | 18.1 | 27.7 | 12.9 | 46.2 | 11.8 | 52.8 | 18.7 | 60.1 | 82.3 | 92.1 | 23.5 | 55.7 | 47.4 | 65.4 | checkpoint |
FrozenSeg (ConvNeXt-Large) | 25.9 | 16.4 | 34.4 | 59.9 | 45.8 | 28.4 | 56.8 | 18.5 | 27.3 | 19.3 | 52.3 | 14.8 | 51.4 | 19.7 | 60.2 | 82.5 | 92.1 | 25.6 | 56.2 | 47.3 | 65.5 | checkpoint |
If you use FrozenSeg in your research, please use the following BibTeX entry.
@misc{FrozenSeg,
title={FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation},
author={Xi Chen and Haosen Yang and Sheng Jin and Xiatian Zhu and Hongxun Yao},
publisher={arXiv:5835590},
year={2024}
}
Detectron2, Mask2Former, Segment Anything, OpenCLIP and FC-CLIP.