IMP NOTE: Use the test_set.json file for evaluating performance.
We are releasing a Foundational FSOD challenge as part of the Workshop on Visual Perception and Learning in an Open World at CVPR 2024. We are accepting submissions till 7th June 2024!
Few-shot object detection (FSOD) benchmarks have advanced techniques for detecting new categories with limited annotations. Existing benchmarks repurpose wellestablished datasets like COCO by partitioning categories into base and novel classes for pre-training and finetuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice. Rather than only pre-training on a small number of base categories, we argue that it is more practical to fine-tune a foundation model (e.g., a vision-language model (VLM) pre-trained on webscale data) for a target domain. Surprisingly, we find that zero-shot inference from VLMs like GroundingDINO significantly outperforms the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models can still be misaligned to target concepts of interest. For example, trailers on the web may be different from trailers in the context of autonomous vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on K-shots per target class. Further, we note that current FSOD benchmarks are actually federated datasets containing exhaustive annotations for each category on a subset of the data. We leverage this insight to propose simple strategies for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of our approach on LVIS and nuImages, improving over prior work by 5.9 AP.
See installation instructions.
Create models/
in the root directory and download pre-trained model here
python train_net.py --num-gpus 1 --config-file <config_path> --pred_all_class OUTPUT_DIR_PREFIX <root_output_dir>
- Naive Finetuning:
configs/nuimages_cr/code_release_v2/naive_ft_shots10_seed_0.yaml
- FedLoss:
configs/nuimages_cr/code_release_v2/fedloss_num_sample_cats_4_shots10_seed_0.yaml
- Inverse FedLoss:
configs/nuimages_cr/code_release_v2/invfedloss_num_sample_cats_4_shots10_seed_0.yaml
- Pseudo-Negatives:
configs/nuimages_cr/code_release_v2/pseudo_negatives_shots10_seed_0.yaml
- True Negatives Oracle:
configs/nuimages_cr/code_release_v2/detfedloss_shots10_seed_0.yaml
DATASETS.TRAIN
: Specify training split according to registered datasets.DATASETS.TEST
: Specify testing split according to registered datasets.ROI_BOX_HEAD.FED_LOSS_NUM_CAT
: Number of categories to be sampled for FedLossROI_BOX_HEAD.USE_FED_LOSS
: Flag to enable federated lossROI_BOX_HEAD.INVERSE_WEIGHTS
: Flag to enable inverse frequency weights with federated lossROI_BOX_HEAD.ALL_ANN_FILE
: Used in sampling strategy for FedLoss. If using with Pseudo-Negatives, use predictions from the teacher model.
- Train (or use an off-the-shelf pre-trained) teacher model.
- Make a new config to run inference on the FSOD trainset (e.g. Set DATASETS.TEST to
nuimages_fsod_train_seed_0_shots_10
) - Convert the generated predictions
.pth
file to the COCO format by usingtools/convert_preds_to_ann.py
. Be sure to specify the confidence threshold to filter pseudolabels. The predictions are saved in the directory corresponding to the trainset evaluation. See sample command below:
python tools/convert_preds_to_ann.py --pred_path_train <path_trainset_eval_pth_file> --dataset_name nuimages_fsod_train_seed_0_shots_10 --conf_thresh 0.2
- Set
ROI_BOX_HEAD.ALL_ANN_FILE
to the generated predictions.
python train_net.py --num-gpus 8 --config-file <config_path> --pred_all_class --eval-only MODEL.WEIGHTS <model_path> OUTPUT_DIR_PREFIX <root_output_dir>
- Code cleanup
- Release FSOD training files
- FIOD support and config
- FSOD Data split creation : nuImages along with new split
- Release trained model
- LVIS support in data and training models
We thank the authors of the following repositories for their open-source implementations which were used in building the current codebase:
If you find our paper and code repository useful, please cite us:
@article{madan2023revisiting,
title={Revisiting Few-Shot Object Detection with Vision-Language Models},
author={Madan, Anish and Peri, Neehar and Kong, Shu and Ramanan, Deva},
journal={arXiv preprint arXiv:2312.14494},
year={2023}
}