Skip to content

chenxi52/FrozenSeg

Repository files navigation

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

This repository is the official implementation of FrozenSeg introduced in the paper:

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

Abstract

Open-vocabulary segmentation is challenging, with the need of segmenting and recognizing objects for an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models like CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite demonstrating strong performances, they still face a fundamental challenge of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into learnable query and CLIP feature in the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a light transformer decoder for mask proposal generation – the performance bottleneck. Extensive experiments show that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data and tested in a zero-shot manner.

FrozenSeg design

Dependencies and Installation

See installation instructions.

Getting Started

See Preparing Datasets.

See Getting Started.

Models

ADE20K(A-150) Cityscapes Mapillary Vistas BDD 100K A-847 PC-459 PAS-21 Lvis COCO
(training dataset)
download
PQ mAP mIoU FWIoU PQ mAP mIoU PQ mIoU PQ mIoU mIoU FWIoU mIoU FWIoU mIoU FWIoU APr PQ mAP mIoU
FrozenSeg (ResNet50x64) 23.1 13.5 30.7 56.6 45.2 28.9 56.0 18.1 27.7 12.9 46.2 11.8 52.8 18.7 60.1 82.3 92.1 23.5 55.7 47.4 65.4 checkpoint
FrozenSeg (ConvNeXt-Large) 25.9 16.4 34.4 59.9 45.8 28.4 56.8 18.5 27.3 19.3 52.3 14.8 51.4 19.7 60.2 82.5 92.1 25.6 56.2 47.3 65.5 checkpoint

Citing

If you use FrozenSeg in your research, please use the following BibTeX entry.

@misc{FrozenSeg,
  title={FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation},
  author={Xi Chen and Haosen Yang and Sheng Jin and Xiatian Zhu and Hongxun Yao},
  publisher={arXiv:5835590},
  year={2024}
}

Acknowledgement

Detectron2, Mask2Former, Segment Anything, OpenCLIP and FC-CLIP.