This repository contains the code for my MS thesis.
Fig. 1. Example output of our model. At test time, given a sentence S and an image I, the model is required to ground noun phrases in S, each with the most likely proposal among proposals extracted by an object detector on I.
The model is built with Python 3 and PyTorch.
We address the problem of phrase grounding, i.e. the task of locating the content of the image referenced by the sentence, by using weak supervision. Phrase grounding is a challenging problem that requires joint understanding of both visual and textual modalities, while being an important application in many field of study such as visual question answering, image retrieval and robotic navigation. We propose a simple model that leverages on concept similarity, i.e. the similarity between a concept in phrases and the proposal bounding boxes label. We apply such measure as a prior on our model prediction. Then the model is trained to maximize multimodal similarity between an image and a sentence describing that image, while minimizing instead the multimodal similarity between the image and a sentence not describing the image. Our experiments shows comparable performance with respect to State-of-the-Art works.
Model parameters and workflow are settable from the command line interface, run
python main.py --help
to show the help message.
Go to the requirement section in order to install all dependencies and see data section to prepare datasets.
Some examples:
- Train the model
python main.py \
--use-wandb \
--workflow train \
--log-level 10 \
--num-workers 8 \
--prefetch-factor 2 \
--suffix awesome \
--n-epochs 30 \
--n-box 100 \
--learning-rate 0.001 \
--use-spell-correction \
--word-embedding w2v \
--localization-strategy max \
--text-embedding-size 300 \
--text-recurrent-network-type lstm \
--text-semantic-size 1000 \
--text-semantic-num-layers 1 \
--image-embedding-size 2053 \
--image-projection-net mlp \
--image-projection-size 1000 \
--image-projection-hidden-layers 0 \
--apply-concept-similarity-strategy mean \
--apply-concept-similarity-weight 0.5
- Restore a trained model and test it
python main.py \
--use-wandb \
--workflow test \
--log-level 10 \
--num-workers 8 \
--prefetch-factor 2 \
--n-epochs 30 \
--n-box 100 \
--learning-rate 0.001 \
--use-spell-correction \
--word-embedding w2v \
--localization-strategy max \
--text-embedding-size 300 \
--text-recurrent-network-type lstm \
--text-semantic-size 1000 \
--text-semantic-num-layers 1 \
--image-embedding-size 2053 \
--image-projection-net mlp \
--image-projection-size 1000 \
--image-projection-hidden-layers 0 \
--apply-concept-similarity-strategy mean \
--apply-concept-similarity-weight 0.5 \
--restore /path/to/model_awesome_17.pth
conda env create -n weakvtg -f environment.yml
- Download spaCy resources
python -m spacy download en_core_web_sm
Note: the first execution may require some time for the downloading of GloVe or Word2Vec pretrained word embeddings.
Note: depending on your system, you may need to install PyTorch compiled for CPU-only.
In order to run the model, you need to download ReferIt and Flickr30k Entities datasets.
The model loads data from the data
folder, so I suggest creating two local
folder (e.g., referit_data
and flickr30k_data
) and symlink one of the
two: ln -s referit_data data
.
ReferIt
The ReferIt data folder MUST be structured as follows:
referit_data
|-- attributes_vocab.txt
|-- objects_vocab.txt
|-- refer
| |-- data
| |-- evaluation
| |-- external
| |-- LICENSE
| |-- Makefile
| |-- pyEvalDemo.ipynb
| |-- pyReferDemo.ipynb
| |-- README.md
| |-- refer.py
| |-- setup.py
| `-- test
`-- referit_raw
|-- out_bu
|-- out_ewiser
|-- preprocessed
|-- test.txt
|-- train.txt
|-- val.txt
|-- vocab.json
|-- vocab_yago.json
`-- yago_align.json
where refer
is the exact clone of ReferIt
while referit_raw
contains our preprocessed data.
Flickr30k Entities
The Flickr30k Entities data folder MUST be structured as follows:
flickr30k_data
|-- attributes_vocab.txt
|-- flickr30k
| |-- flickr30k_entities
| `-- flickr30k_images
|-- flickr30k_raw
| |-- out_bu
| |-- out_ewiser
| |-- out_ewiser_queries
| |-- preprocessed
| |-- preprocessed3
| |-- vocab.json
| |-- vocab_yago.json
| `-- yago_align.json
|-- objects_vocab.txt
`-- relations_vocab.txt
where flickr30k
folder Flickr's annotations and images. Annotations can be
found at Annotations.zip
on Flickr30k Entities repository, while images can be downloaded from
original Flickr30k dataset.
Preprocessed Data (both dataset)
The code used for data preprocessing can be found at VTKEL-solver (see the paper).
NOTE: at time of writing, the VTKEL-solver repository is not yet public. Please if you need our preprocessing pipeline please write me at luca.parolari23@gmail.com.
- master-thesis, thesis dissertation (LaTeX source code + artifacts).
- master-thesis-presentation, thesis presentation + talk (LaTex source code + artifacts).
- master-thesis-report, quasi-final thesis report (LaTeX source code + artifacts).
- master-thesis-log, contains scripts, notebooks, notes, todos and logs about the thesis.
Luca Parolari
- Email: luca.parolari23@gmail.com
- GitHub: @lparolari
- Telegram: @lparolari
Honor mention:
Davide Rigoni
- Email: davide.rigoni.2@phd.unipd.it
- GitHub: @drigoni
- Website: www.drigoni.it
MIT