What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

By Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, and Xiaojuan Qi.

This is the official repository for our NeurIPS 2024 paper What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights.

What is this paper about

Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations.

With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks.

What is covered in this repo

Code for estimating concept fequency of image-text datasets → concept_freq_utils
Code for curating image-text dataset variants with controlled differences → data_utils
Code for training SL/CLIP models and reporting various per-class metrics → exps_sup and exps_clip
Code for training DINO variants and transferring to downstream tasks → exps_dino
Code for replicating analytical figures in the paper → fig_utils

How to use this repo

Clone the repo:

git clone https://github.com/CVMI-Lab/clip-beyond-tail.git && cd clip-beyond-tail
git submodule update --init --recursive  # Also necessary if we updated any submodules

Then please explore the subdirectories mentioned above for detailed instructions.

Citing this work

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{wen2024generalization,
  title={What Makes {CLIP} More Robust to Long-Tailed Pre-Training Data? {A} Controlled Study for Transferable Insights},
  author={Wen, Xin and Zhao, Bingchen and Chen, Yilun and Pang, Jiangmiao and Qi, Xiaojuan},
  booktitle={Advances in Neural Information Processing Systems},
  volume={37},
  year={2024}
}

Acknowledgment

Our codebase is inspired by and builds upon several existing publicly available codes. We express our gratitiude to the authors of CLIP, open_clip, WaffleCLIP, MetaCLIP, imagenet-captions, eval-on-laion, detic, swav, dino, and ssl-transfer.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

What is this paper about

What is covered in this repo

How to use this repo

Citing this work

Acknowledgment

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

What is this paper about

What is covered in this repo

How to use this repo

Citing this work

Acknowledgment

License