seq2seq_temporal_attention is a tool for automatic video captioning. This is an implementation of Generating Video Description using Sequence-to-sequence Model with Temporal Attention (PDF).
- Python 2 or Python 3
To train a model, Python 2 is required. - OpenCV
Make sure that modules for video are included. If you encounter an error while extracting frames, perhaps you can get helpful information from here: OpenCV video capture from file fails on Linux. - Chainer
- youtube-dl
For requirements for Windows, read docs/requirements-windows.md.
To test out the tool, run example.sh
.
It gives a caption for an excerpt of the video
titled playing wool ball with my cat : ).
Our models were trained on Microsoft Video Description Dataset.
git clone git@github.com:aistairc/seq2seq_temporal_attention.git --recursive
./download.sh
./example.sh --gpu GPU_ID # It will generate a caption *a cat is playing with a toy*
Note: In most cases, setting GPU_ID
to 0
will work.
If you want to run it without GPU, set the parameter to -1
.
This is an example command to train.
cd code
python chainer_seq2seq_att.py \
--mode train \
--gpu GPU_ID \
--batchsize 40 \
--dropout 0.3 \
--align ('dot'|'bilinear'|'concat'|'none') \
--feature feature_file_name \
output_folder
There are two ways for test, test
and test-batch
.
The latter runs much faster, but it does not use beam search.
Be careful to specify which alignment model you want to use. It has to match your pre-trained model, in order to make it work correctly.
cd code
python chainer_seq2seq_att.py \
--mode ('test'|'test-batch') \
--gpu GPU_ID \
--model path_to_model_file \
--align ('dot'|'bilinear'|'concat'|'none') \
--feature feature_file_name \
output_folder