This repository provides an implementation that combines the Blip Image Captioning model with a fine-tuned GPT-2 model for generating happy responses to the derived captions. The process involves utilizing the Salesforce Blip Image Captioning model to generate image captions and then passing those captions through the trained GPT-2 Happy Model.
To run the code in this repository, the following dependencies are required:
- Python 3.x
- Transformers library
- pandas
- numpy
- tqdm
- PIL (Python Imaging Library)
Please ensure that these dependencies are installed before proceeding.
The provided code includes preprocessing steps to prepare the data for training the GPT-2 Happy Model. It includes the following steps:
-
Cleaning the Data: The
clean
function reads a CSV file (cleaned_hm.csv
) containing the happy moments data. It removes any rows with missing values in the 'cleaned_hm' column and drops duplicate entries. The cleaned data is saved ashappy_moments.csv
. -
Splitting the Data: The
split
function reads the cleaned data fromhappy_moments.csv
and splits it into training and test datasets. The data is randomly sampled, with 70% used for training and the remaining 30% for testing. The train and test datasets are saved astrain.csv
andtest.csv
, respectively. -
Tokenizing and Saving the Data: The
tokenize_save
function tokenizes the text data in the train and test datasets using the GPT-2 tokenizer. The tokenized data is saved astrain_mod.txt
andtest_mod.txt
.
To train the GPT-2 Happy Model, follow these steps:
-
Load the necessary dependencies, including the
AutoTokenizer
,AutoModelWithLMHead
,Trainer
,TrainingArguments
,DataCollatorForLanguageModeling
, andTextDataset
from the Transformers library. -
Define the
load_dataset
function, which loads the tokenized train and test datasets using the GPT-2 tokenizer. It returns the train dataset, test dataset, and data collator required for language modeling. -
Define the
train_setup
function, which sets up the training configuration for the GPT-2 model. It initializes the GPT-2 model, loads the train and test datasets, and specifies the training arguments such as the output directory, batch sizes, logging steps, and more. It returns the trainer object. -
Instantiate the trainer by calling the
train_setup
function. -
Execute the training process by running
trainer.train()
. The model will be trained for the specified number of epochs using the provided training arguments.
After training the GPT-2 Happy Model, you can generate happy responses to derived captions using the Blip Image Captioning model. Here's how it works:
-
Load the necessary dependencies, including the
BlipProcessor
,BlipForConditionalGeneration
,AutoTokenizer
, andAutoModelWithLMHead
. -
Instantiate the
BlipProcessor
andBlipForConditionalGeneration
models using the pre-trained weights from the "Salesforce/blip-image-captioning-large" checkpoint. -
Specify the URL of the image for which you want to generate captions.
-
Open and convert the image from the URL to the RGB format using
PIL.Image
. -
Perform conditional image captioning by providing a partial caption and tokenizing the image and text using the
BlipProcessor
. Generate the caption using theBlipForConditionalGeneration
model's `
generate` method.
- Pass the generated caption through the trained GPT-2 Happy Model to obtain a happy response. This involves tokenizing the caption using the GPT-2 tokenizer and using the GPT-2 model to generate the response.
By combining the Blip Image Captioning model and the GPT-2 Happy Model, you can generate happy responses based on derived captions from images. This repository provides the necessary code and preprocessing steps to perform this task.
- J. Li, D. Li, C. Xiong, and S. Hoi, ‘BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation’. arXiv, 2022.