A tensorflow 2.0 with keras implementation trained on MS COCO dataset.
The task of image captoning is the task of generating a sequence of words to descripe an encoded image. It is done through the well known sequence-to-sequence architecture. An encoder (of a mostly pretrained CNN model) encodes the image, then an RNN decoder outputs a word in each of its steps.
I found this google documentation quite useful when coding both variants:
You can either run image_captioning_train.ipynb for training the whole thing from the start while having the option of changing the hyper parameters or image_captioing.ipynb for captioing any image by giving its path. In case you use the latter, make sure you have 'Encoder.hdf5','Decoder.hdf5' and 'tokenizer.pickle' downloaded. the encoder and decoder can be found in release. For the model with attention get the ones tagged with v2.0, otherwise get the ones tagged with v1.0.
There are two variants of the model that both use seq2seq architecture with teacher-forcing. Difference is that one uses attention mechanism while the other doesn't. You should go for the model variant with attention in almost all cases of usage as it is faster, smaller and more accurate. The other variant can be treated as a tutorial for those of you seeking basic understanding of the implementation of an image captioning model. This simpler archeticure is then built upon by the variant with attention. Encoding is done through using inception_v3. Feel free to change any of the parameters found in the configuration cell like LSTM_size, encoding_size or even the number of samples this network trains on.
You can get the notebooks uploaded here via google colab through these links:
Without attention training notebook
Without attention prediction notebook
With attention training notebook