Batched multilingual caption generation using PaliGemma 3B! #7953

bghira · 2024-05-15T19:41:56Z

bghira
May 15, 2024

Multilingual captioning with PaliGemma 3B

Motivation

The default code examples for the PaliGemma series I think are very fast, but limited.

I wanted to see what these models were capable of, so I did a parameter sweep and tested various prompting strategies. The default code examples use do_sample=False which greatly limits the versatility of the model.

One major strength that stood out was the ability of these models to translate their outputs.

I've put together an example on batch inference for the google/paligemma-3b-mix-224 model which runs at the lower 224px resolution, but takes about 9 seconds to produce 5 captions in various languages on a M3 Max 128G.

Usage example

python caption_with_gemma.py --input_folder /path/to/images --output_parquet /path/to/dataset/prefix

This will scan any image subfolders in /path/to/images and write a parquet database to /path/to/dataset/prefix.subfolder.parquet for each subfolder.

It's a very basic example which doesn't reload the datasets if you close and re-run the file. However, it's a good starting point!

Code

import os
import logging
import argparse
import requests
from PIL import Image
from tqdm import tqdm
import pandas as pd
import torch
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration

logger = logging.getLogger("Captioner")

# Function to load PaliGemma model and processor
def load_pali_gemma_model(args):
    model_id = args.model_path
    model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).to(torch.float32).eval()
    processor = AutoProcessor.from_pretrained(model_id)
    return model, processor

def generate_caption_with_pali_gemma(image_path, processor, model, query_strings, do_sample=True, temperature=0.7):
    if image_path.startswith("http://") or image_path.startswith("https://"):
        image = Image.open(requests.get(image_path, stream=True).raw)
    else:
        image = Image.open(image_path)

    model_inputs = processor(text=query_strings, images=[image] * len(query_strings), return_tensors="pt")
    input_len = model_inputs["input_ids"].shape[-1]

    with torch.inference_mode():
        generation = model.generate(
            **model_inputs, 
            max_new_tokens=100, 
            do_sample=do_sample, 
            temperature=temperature, 
            top_p=0.9, 
            top_k=50
        )
        outputs = []
        for _generation in generation:
            decoded = processor.decode(_generation[input_len:], skip_special_tokens=True)
            outputs.append(decoded)
    return outputs

def process_and_evaluate_image(args, image_path, model, processor):
    query_strings = [
        "caption en",
        "caption es",
        "caption hi",
        "caption de",
        "caption fr"
    ]
    
    longest_caption = None
    longest_query = None
    caption_length = 0
    result = generate_caption_with_pali_gemma(image_path, processor, model, query_strings, do_sample=True, temperature=0.9)
    if len(result) > caption_length:
        longest_caption = result
        caption_length = len(result)
    print(f"String: {result}\n")
    return longest_caption

def process_directory(args, image_dir, output_parquet, model, processor):
    records = []
    parquet_path = f"{output_parquet}.{os.path.basename(image_dir)}.parquet"
    print(f"Parquet: {parquet_path}")
    for filename in tqdm(os.listdir(image_dir), desc="Processing Images"):
        full_filepath = os.path.join(image_dir, filename)
        if os.path.isdir(full_filepath):
            logging.info(f"Found directory to traverse: {full_filepath}")
            process_directory(args, full_filepath, output_parquet, model, processor)
        elif filename.lower().endswith((".jpg", ".png")):
            try:
                logging.info(f"Attempting to load image: {filename}")
                with Image.open(full_filepath) as image:
                    logging.debug(f"Processing image: {filename}, data: {image}")
                    best_match = process_and_evaluate_image(args, full_filepath, model, processor)
                    logging.info(f"Best match for {filename}: {best_match}")

                    with Image.open(full_filepath) as img_file:
                        image_bytes = img_file.tobytes()

                    records.append({
                        "filename": filename,
                        "caption": best_match,
                        "image": image_bytes
                    })

            except Exception as e:
                import traceback
                logging.error(f"Error processing {filename}: {str(e)}, traceback: {traceback.format_exc()}")
                if "CUDA error" in str(e):
                    import sys
                    sys.exit(1)

    df = pd.DataFrame(records)
    df.to_parquet(parquet_path, engine="pyarrow")
    logging.info(f"Processed Parquet file saved to {output_parquet}")

def parse_args():
    parser = argparse.ArgumentParser(description="Process images and generate captions.")
    parser.add_argument("--input_dir", type=str, required=True, help="Directory containing the images.")
    parser.add_argument("--output_parquet", type=str, required=True, help="Path to the output Parquet dataset.")
    parser.add_argument("--precision", type=str, choices=["bf16", "fp16"], default="fp16", help=("Precision for loading the model. Default: fp16"))
    parser.add_argument("--model_path", type=str, default="google/paligemma-3b-mix-224", help=("Model path to load. Default: google/paligemma-3b-mix-224"))

    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    logging.basicConfig(level=logging.INFO)

    model, processor = load_pali_gemma_model(args)
    process_directory(args, args.input_dir, args.output_parquet, model, processor)

if __name__ == "__main__":
    main()

Test image

Results

In the image there is a red flower to a plant and around the plant there are many leaves.
Una flor roja que está encima de una planta.
इस पेड़ का फूल है ।
Eine nahaufnahme eines roten blühenden baumes mit grünem laub.
Une fleur rouge à l' extérieur à côté de quelques feuilles vertes.

Switching models

Other recommended models:

--model_path=google/paligemma-3b-mix-448 - same type of model, but higher resolution.
--model_path=google/paligemma-3b-pt-224 - base model, but versatile
--model_path=google/paligemma-3b-pt-448 - higher resolution base model
--model_path=google/paligemma-3b-ft-coco35l-448 - will only really output captions, but they match the COCO style.

Performance notes

If you need to go as fast as possible, remove the batched inputs and use a lower-resolution model.

Model quality

The finetuned task-specific models seem to be difficult to prompt and generally fail to reasonably caption images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched multilingual caption generation using PaliGemma 3B! #7953

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Batched multilingual caption generation using PaliGemma 3B! #7953

bghira May 15, 2024

Multilingual captioning with PaliGemma 3B

Motivation

Usage example

Code

Test image

Results

Switching models

Performance notes

Model quality

Replies: 0 comments

bghira
May 15, 2024