neural-vqa

This is an experimental Torch implementation of the VIS + LSTM visual question answering model from the paper Exploring Models and Data for Image Question Answering by Mengye Ren, Ryan Kiros & Richard Zemel.

Setup

Requirements:

Torch
loadcaffe

Download the MSCOCO train+val images and VQA data using sh data/download_data.sh. Extract all the downloaded zip files inside the data folder.

unzip Annotations_Train_mscoco.zip
unzip Questions_Train_mscoco.zip
unzip train2014.zip

unzip Annotations_Val_mscoco.zip
unzip Questions_Val_mscoco.zip
unzip val2014.zip

If you had them downloaded already, copy over the train2014 and val2014 image folders and VQA JSON files to the data folder.

Download the VGG-19 Caffe model and prototxt using sh models/download_models.sh.

Known issues

To avoid memory issues with LuaJIT, install Torch with Lua 5.1 (TORCH_LUA_VERSION=LUA51 ./install.sh). More instructions here.
If working with plain Lua, luaffifb may be needed for loadcaffe, unless using pre-extracted fc7 features.

Usage

Extract image features

th extract_fc7.lua -split train
th extract_fc7.lua -split val

Options

batch_size: Batch size. Default is 10.
split: train/val. Default is train.
gpuid: 0-indexed id of GPU to use. Default is -1 = CPU.
proto_file: Path to the deploy.prototxt file for the VGG Caffe model. Default is models/VGG_ILSVRC_19_layers_deploy.prototxt.
model_file: Path to the .caffemodel file for the VGG Caffe model. Default is models/VGG_ILSVRC_19_layers.caffemodel.
data_dir: Data directory. Default is data.
feat_layer: Layer to extract features from. Default is fc7.
input_image_dir: Image directory. Default is data.

Training

th train.lua

Options

rnn_size: Size of LSTM internal state. Default is 512.
num_layers: Number of layers in LSTM
embedding_size: Size of word embeddings. Default is 512.
learning_rate: Learning rate. Default is 4e-4.
learning_rate_decay: Learning rate decay factor. Default is 0.95.
learning_rate_decay_after: In number of epochs, when to start decaying the learning rate. Default is 15.
alpha: Alpha for adam. Default is 0.8
beta: Beta used for adam. Default is 0.999.
epsilon: Denominator term for smoothing. Default is 1e-8.
batch_size: Batch size. Default is 64.
max_epochs: Number of full passes through the training data. Default is 15.
dropout: Dropout for regularization. Probability of dropping input. Default is 0.5.
init_from: Initialize network parameters from checkpoint at this path.
save_every: No. of iterations after which to checkpoint. Default is 1000.
train_fc7_file: Path to fc7 features of training set. Default is data/train_fc7.t7.
fc7_image_id_file: Path to fc7 image ids of training set. Default is data/train_fc7_image_id.t7.
val_fc7_file: Path to fc7 features of validation set. Default is data/val_fc7.t7.
val_fc7_image_id_file: Path to fc7 image ids of validation set. Default is data/val_fc7_image_id.t7.
data_dir: Data directory. Default is data.
checkpoint_dir: Checkpoint directory. Default is checkpoints.
savefile: Filename to save checkpoint to. Default is vqa.
gpuid: 0-indexed id of GPU to use. Default is -1 = CPU.

Testing

th predict.lua -checkpoint_file checkpoints/vqa_epoch23.26_0.4610.t7 -input_image_path data/train2014/COCO_train2014_000000405541.jpg -question 'What is the cat on?'

Options

checkpoint_file: Path to model checkpoint to initialize network parameters fro
input_image_path: Path to input image
question: Question string

Sample predictions

Randomly sampled image-question pairs from the VQA test set, and answers predicted by the VIS+LSTM model.

Q: What animals are those? A: Sheep

Q: What color is the frisbee that's upside down? A: Red

Q: What is flying in the sky? A: Kite

Q: What color is court? A: Blue

Q: What is in the standing person's hands? A: Bat

Q: Are they riding horses both the same color? A: No

Q: What shape is the plate? A: Round

Q: Is the man wearing socks? A: Yes

Q: What is over the woman's left shoulder? A: Fork

Q: Where are the pink flowers? A: On wall

Implementation Details

Last hidden layer image features from VGG-19
Zero-padded question sequences for batched implementation
Training questions are filtered for top_n answers, top_n = 1000 by default (~87% coverage)

Pretrained model and data files

To reproduce results shown on this page or try your own image-question pairs, download the following and run predict.lua with the appropriate paths.

vqa_epoch23.26_0.4610.t7 (Serialized using Lua51) [GPU] [CPU]
answers_vocab.t7
questions_vocab.t7
data.t7

References

Exploring Models and Data for Image Question Answering, Ren et al., NIPS15
VQA: Visual Question Answering, Antol et al., ICCV15

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

neural-vqa

Setup

Known issues

Usage

Extract image features

Options

Training

Options

Testing

Options

Sample predictions

Implementation Details

Pretrained model and data files

References

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

neural-vqa

Setup

Known issues

Usage

Extract image features

Options

Training

Options

Testing

Options

Sample predictions

Implementation Details

Pretrained model and data files

References

License