Skip to content

ologandavid/DenseVideoCaptioning

Repository files navigation

Modified PDVC with Semantic Alignment

Modified Implementation for End-to-End Dense Video Captioning with Parallel Decoding and Semantic Alignment [paper]

Original Implementation for End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021) [paper]

This repo supports:

  • two video captioning tasks: dense video captioning and video paragraph captioning
  • dataset: YouCook2
  • video features containing TSN
  • visualization of the generated captions of your own videos

Table of Contents:

Updates

  • (2023.05.05) added tuner network into PDVC framework to semantically align input video features. Tuner .pth files can be found in ./model_files.
  • (2021.11.19) add code for running PDVC on raw videos and visualize the generated captions (support Chinese and other non-English languages)
  • (2021.11.19) add pretrained models with TSP features. It achieves 9.03 METEOR(2021) and 6.05 SODA_c, a very competitive results on ActivityNet Captions without self-critical sequence training.
  • (2021.08.29) add TSN pretrained models and support YouCook2

Introduction

The basis of this model is PDVC. PDVC is a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. It is shown that through using a tuner network which semantically aligns the video features, the overall performance PDVC can be improved across all metrics. This tuner model has been trained exclusively on the YouCook2 dataset. pdvc.png

Installation

Environment: Linux, GCC>=5.4, CUDA >= 9.2, Python>=3.7, PyTorch>=1.5.1

  1. Clone the repo
git clone --recursive https://github.com/ttengwang/PDVC.git
  1. Create vitual environment by conda
conda create -n PDVC python=3.7
source activate PDVC
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
conda install ffmpeg
pip install -r requirement.txt
sudo apt-get install build-essential
sudo apt-get install ubuntu-drivers-common
sudo apt-get ubuntu-drivers autoinstall
sudo apt install nvidia-cuda-toolkit
  1. Compile the deformable attention layer (requires GCC >= 5.4).
cd pdvc/ops
sh make.sh

Running PDVC on Your Own Videos

Sample Video captioned with Original PDVC. Instructions can be found here.

demo.gifdemo.gif

Training and Validation

Download Video Features

cd data/yc2/features
bash download_yc2_tsn_features.sh

Dense Video Captioning

  1. Train and Eval PDVC
# Training

#Specify Model Path and Type in Lines 32, 230 in train.py and Lines 24, 247 in eval_utils.py

config_path=cfgs/yc2_tsn_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}
# The script will evaluate the model for every epoch. The results and logs are saved in `./save`.

# Evaluation

eval_folder=yc2_tsn_pdvc_baseline # specify the folder to be evaluated
eval_caption_file=data/yc2/captiondata/yc2_val.json
python eval.py --eval_folder ${eval_folder} --eval_caption_file ${eval_caption_file} --eval_transformer_input_type queries
# This script returns the Soda_c scores

eval_json=save/yc2_tsn_pdvc_baseline/2023-04-18-03-29-07_yc2_tsn_pdvc_v_2023-04-18-00-02-08_epoch19_num457_alpha1.0.json_rerank_alpha1.0_temp2.0.json
# Replace this with the json file in the save folder generated during training

python densevid_eval3/evaluate2018.py -v -s ${eval_json} -r data/yc2/captiondata/yc2_val.json

  1. Tuner Model Training Pipeline
#Extract Caption Features Using CLIP

jupyter run Interacting_with_CLIP.ipynb
# Specify Path to Caption Data in ipynb 
# Current Tuner architectures only support CLIP ViT_L_14@336px features

#Training the Tuner

jupyter run IDL_Project_Tuner_Training.ipynb
# Specify model to train by using model = ....
# After training, save pth file into model name {model}_imp

Performance

Dense video captioning (with learnt proposals)

Model Features config_path Url BLEU4 METEOR CIDEr SODA_c
Baseline PDVC TSN cfgs/yc2_tsn_pdvc.yml model 0.76 ± 0.05 4.39 ± 0.07 20.68 ± 0.21 4.47 ± 0.87
Linear TSN cfgs/yc2_tsn_pdvc.yml model 0.87 ± 0.06 4.74 ± 0.09 21.76 ± 0.04 4.45 ± 1.13
Conv1 TSN cfgs/yc2_tsn_pdvc.yml model 0.90 ± 0.02 4.53 ± 0.07 22.32 ± 0.05 4.50 ± 1.48
Conv1 w/ Linear TSN cfgs/yc2_tsn_pdvc.yml model 0.77 ± 0.15 4.48 ± 0.02 21.07 ± 0.92 4.47 ± 1.49
Conv2 TSN cfgs/yc2_tsn_pdvc.yml model 0.40 ± 0.08 3.35 ± 0.01 14.34 ± 0.02 3.53 ± 0.71

Acknowledgement

The implementation of Deformable Transformer is mainly based on Deformable DETR. The implementation of the captioning head is based on ImageCaptioning.pytorch. We thanks the authors for their efforts.

The base framework for PDVC is located here. We encourage you to take a look at their repo.

@inproceedings{wang2021end,
  title={End-to-End Dense Video Captioning with Parallel Decoding},
  author={Wang, Teng and Zhang, Ruimao and Lu, Zhichao and Zheng, Feng and Cheng, Ran and Luo, Ping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6847--6857},
  year={2021}
}

About

11.785: Introduction to Deep Learning Final Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published