Batched multilingual caption generation using PaliGemma 3B! #7953
bghira
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Multilingual captioning with PaliGemma 3B
Motivation
The default code examples for the PaliGemma series I think are very fast, but limited.
I wanted to see what these models were capable of, so I did a parameter sweep and tested various prompting strategies. The default code examples use
do_sample=False
which greatly limits the versatility of the model.One major strength that stood out was the ability of these models to translate their outputs.
I've put together an example on batch inference for the
google/paligemma-3b-mix-224
model which runs at the lower 224px resolution, but takes about 9 seconds to produce 5 captions in various languages on a M3 Max 128G.Usage example
This will scan any image subfolders in
/path/to/images
and write a parquet database to/path/to/dataset/prefix.subfolder.parquet
for each subfolder.It's a very basic example which doesn't reload the datasets if you close and re-run the file. However, it's a good starting point!
Code
Test image
Results
Switching models
Other recommended models:
--model_path=google/paligemma-3b-mix-448
- same type of model, but higher resolution.--model_path=google/paligemma-3b-pt-224
- base model, but versatile--model_path=google/paligemma-3b-pt-448
- higher resolution base model--model_path=google/paligemma-3b-ft-coco35l-448
- will only really output captions, but they match the COCO style.Performance notes
If you need to go as fast as possible, remove the batched inputs and use a lower-resolution model.
Model quality
The finetuned task-specific models seem to be difficult to prompt and generally fail to reasonably caption images.
Beta Was this translation helpful? Give feedback.
All reactions