Recently, OpenAI released some of their bigger CLIP models. Additionally, OpenCLIP is continuing to provide their large models, which have proven to match or even outperform the OpenAI models.
Thanks to the compute provided by Stability.ai and laion.ai, we are now happy to announce that we provide multilingual text encoders for these models! Along with:
- Updated Inference & Training Code
- The Corresponding Machine Translated Image Caption Dataset
- PyPi package installer
None of the M-CLIP models have been extensivly evaluated, but testing them on Txt2Img retrieval on the humanly translated MS-COCO dataset, we see the following R@10 results:
Name | En | De | Es | Fr | Zh | It | Pl | Ko | Ru | Tr | Jp |
---|---|---|---|---|---|---|---|---|---|---|---|
OpenAI CLIP Vit-B/32 | 90.3 | - | - | - | - | - | - | - | - | - | - |
OpenAI CLIP Vit-L/14 | 91.8 | - | - | - | - | - | - | - | - | - | - |
OpenCLIP ViT-B-16+- | 94.3 | - | - | - | - | - | - | - | - | - | - |
LABSE Vit-L/14 | 91.6 | 89.6 | 89.5 | 89.9 | 88.9 | 90.1 | 89.8 | 80.8 | 85.5 | 89.8 | 73.9 |
XLM-R Large Vit-B/32 | 91.8 | 88.7 | 89.1 | 89.4 | 89.3 | 89.8 | 91.4 | 82.1 | 86.1 | 88.8 | 81.0 |
XLM-R Vit-L/14 | 92.4 | 90.6 | 91.0 | 90.0 | 89.7 | 91.1 | 91.3 | 85.2 | 85.8 | 90.3 | 81.9 |
XLM-R Large Vit-B/16+ | 95.0 | 93.0 | 93.6 | 93.1 | 94.0 | 93.1 | 94.4 | 89.0 | 90.0 | 93.0 | 84.2 |
To our surprise, using M-CLIP with XLM-RoBerta Large outperforms the original English models for English. Exactly why this is the case reamins to be determined, and we plan to followup up with more extensive testing.
The ViT-L/14 model is integrated into clip retrieval, you can test the retrieval capabilities of this multilingual encoder there. This is a search over 5 billion of clip embeddings of laion5B dataset implemented with an efficient knn index.
The training curves for these models can be found at the Weights and Biases report
English image captions were taken from the Vit-L filtered captions of the datasets: CC3M+CC12M+SBU, which are provided by the BLIP repository.
From these 14 million captions we sampled 7 million captions, divided them into 48 equally sized buckets, and translated each bucket into one of the 48 target languages. This means that after translation we still end up with a total of 7 million captions. Where 7M/48 = 145,833 of them are in for example Dutch. The machine-translated captions are available at Huggingface.
Each translation was performed with the corresponding Opus model. For more information see the machine translation instructions.
It should be noted that only translated captions were used during training. Meaning that none of the original English captions were included. This entails that all the English (and other languages not included in the 49 target languages) results are due to transfer learning.
All released models used in essence the same hyperparameters. These detail are available at Weights and Biases project.
Following is a short list of some of the shared hyperparameters:
- Batch size of 2048 samples.
- Adam Optimizer with a target learning rate of 10^-5, with a linear warmup schedule for 1k update steps.
- 5000 randomly sampled validation samples
All models were allowed to train until the validation MSE loss had converged. For most models this took about 24 hours, using 8 Nvidia A-100 GPUs. No early stopping was performed in regard to the Image-Text retrieval tasks.
In addition to the released models, we also performed some experiments that yielded negative or unsubstantial results. The training curves and specific settings for most of these additional experiments can be found at the Weights and Biases project.
Following is a summary of things we tried:
- Optimizing the Cosine-Similarity instead of minimizing the mean-squared error: No noticeable performance difference.
- MBERT-BASE as encoder: Worse performance than LaBSE
- USE-CML: Worse performance than LaBSE
- Adding additional TanH layer to the XLM-R Large: No substantial performance difference, although it achieved slightly faster learning at the start.
- Using first ([CLS]?) token as sentence embedding, instead of mean-pooling for XLM-R Large: Significantly worse performance. (Perhaps due to the lack of Next-Sentence Prediction task in the RoBerta pre-training?)