"A picture is worth a thousand words" is an idiom that conveys its meaning as 'Seeing something is better for learning than having it described'. But what if you see something(image/scene) for the first time and your brain can't analyze what is it?
Don't worry! An automatic AI model which generators caption or explain the scene is all you need to analyze smth you see for the first time.
This project is all about generating captions by extracting the features from the images and predicting the captions from the model.
The MS COCO (Microsoft Common Objects in Context) dataset is large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. The 2014 training/validation split is 118K/5K and the test set is a subset of 41K images. The dataset has annotations for captioning: natural language descriptions of the images. You can download the dataset here https://cocodataset.org/#download
Resnet50 is used for image classification and for the extraction of features.
The attention Based Mechanism is used for caption generation. I have used Bahdanau’s Attention or Local Attention. The attention mechanism is focusing on the relevant part of the image, so the decoder only uses specific parts of the image.
This is a phase 1 model, as I will be improving it using transforms or GPT3 model