Modified PDVC with Semantic Alignment

Modified Implementation for End-to-End Dense Video Captioning with Parallel Decoding and Semantic Alignment [paper]

Original Implementation for End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021) [paper]

This repo supports:

two video captioning tasks: dense video captioning and video paragraph captioning
dataset: YouCook2
video features containing TSN
visualization of the generated captions of your own videos

Table of Contents:

Updates
Introduction
Installation
Running PDVC on Your Own Videos
Training and Validation
- Download Video Features
- Dense Video Captioning
Performance
- Dense video captioning
Acknowledgement

Updates

(2023.05.05) added tuner network into PDVC framework to semantically align input video features. Tuner .pth files can be found in ./model_files.
(2021.11.19) add code for running PDVC on raw videos and visualize the generated captions (support Chinese and other non-English languages)
(2021.11.19) add pretrained models with TSP features. It achieves 9.03 METEOR(2021) and 6.05 SODA_c, a very competitive results on ActivityNet Captions without self-critical sequence training.
(2021.08.29) add TSN pretrained models and support YouCook2

Introduction

The basis of this model is PDVC. PDVC is a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. It is shown that through using a tuner network which semantically aligns the video features, the overall performance PDVC can be improved across all metrics. This tuner model has been trained exclusively on the YouCook2 dataset.

Installation

Environment: Linux, GCC>=5.4, CUDA >= 9.2, Python>=3.7, PyTorch>=1.5.1

Clone the repo

git clone --recursive https://github.com/ttengwang/PDVC.git

Create vitual environment by conda

conda create -n PDVC python=3.7
source activate PDVC
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
conda install ffmpeg
pip install -r requirement.txt
sudo apt-get install build-essential
sudo apt-get install ubuntu-drivers-common
sudo apt-get ubuntu-drivers autoinstall
sudo apt install nvidia-cuda-toolkit

Compile the deformable attention layer (requires GCC >= 5.4).

cd pdvc/ops
sh make.sh

Running PDVC on Your Own Videos

Sample Video captioned with Original PDVC. Instructions can be found here.

Training and Validation

Download Video Features

cd data/yc2/features
bash download_yc2_tsn_features.sh

Dense Video Captioning

Train and Eval PDVC

# Training

#Specify Model Path and Type in Lines 32, 230 in train.py and Lines 24, 247 in eval_utils.py

config_path=cfgs/yc2_tsn_pdvc.yml
python train.py --cfg_path ${config_path} --gpu_id ${GPU_ID}
# The script will evaluate the model for every epoch. The results and logs are saved in `./save`.

# Evaluation

eval_folder=yc2_tsn_pdvc_baseline # specify the folder to be evaluated
eval_caption_file=data/yc2/captiondata/yc2_val.json
python eval.py --eval_folder ${eval_folder} --eval_caption_file ${eval_caption_file} --eval_transformer_input_type queries
# This script returns the Soda_c scores

eval_json=save/yc2_tsn_pdvc_baseline/2023-04-18-03-29-07_yc2_tsn_pdvc_v_2023-04-18-00-02-08_epoch19_num457_alpha1.0.json_rerank_alpha1.0_temp2.0.json
# Replace this with the json file in the save folder generated during training

python densevid_eval3/evaluate2018.py -v -s ${eval_json} -r data/yc2/captiondata/yc2_val.json

Tuner Model Training Pipeline

#Extract Caption Features Using CLIP

jupyter run Interacting_with_CLIP.ipynb
# Specify Path to Caption Data in ipynb 
# Current Tuner architectures only support CLIP ViT_L_14@336px features

#Training the Tuner

jupyter run IDL_Project_Tuner_Training.ipynb
# Specify model to train by using model = ....
# After training, save pth file into model name {model}_imp

Performance

Dense video captioning (with learnt proposals)

Model	Features	config_path	Url	BLEU4	METEOR	CIDEr	SODA_c
Baseline PDVC	TSN	cfgs/yc2_tsn_pdvc.yml	model	0.76 ± 0.05	4.39 ± 0.07	20.68 ± 0.21	4.47 ± 0.87
Linear	TSN	cfgs/yc2_tsn_pdvc.yml	model	0.87 ± 0.06	4.74 ± 0.09	21.76 ± 0.04	4.45 ± 1.13
Conv1	TSN	cfgs/yc2_tsn_pdvc.yml	model	0.90 ± 0.02	4.53 ± 0.07	22.32 ± 0.05	4.50 ± 1.48
Conv1 w/ Linear	TSN	cfgs/yc2_tsn_pdvc.yml	model	0.77 ± 0.15	4.48 ± 0.02	21.07 ± 0.92	4.47 ± 1.49
Conv2	TSN	cfgs/yc2_tsn_pdvc.yml	model	0.40 ± 0.08	3.35 ± 0.01	14.34 ± 0.02	3.53 ± 0.71

Acknowledgement

The implementation of Deformable Transformer is mainly based on Deformable DETR. The implementation of the captioning head is based on ImageCaptioning.pytorch. We thanks the authors for their efforts.

The base framework for PDVC is located here. We encourage you to take a look at their repo.

@inproceedings{wang2021end,
  title={End-to-End Dense Video Captioning with Parallel Decoding},
  author={Wang, Teng and Zhang, Ruimao and Lu, Zhichao and Zheng, Feng and Cheng, Ran and Luo, Ping},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6847--6857},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Documents		Documents
cfgs		cfgs
data		data
densevid_eval3		densevid_eval3
misc		misc
model_files		model_files
pdvc		pdvc
video_backbone		video_backbone
visualization		visualization
.gitignore		.gitignore
.gitmodules		.gitmodules
IDL_Project_Tuner_Training.ipynb		IDL_Project_Tuner_Training.ipynb
Interacting_with_CLIP.ipynb		Interacting_with_CLIP.ipynb
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
eval_utils.py		eval_utils.py
opts.py		opts.py
pdvc.png		pdvc.png
requirement.txt		requirement.txt
test_and_visualize.sh		test_and_visualize.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modified PDVC with Semantic Alignment

Updates

Introduction

Installation

Running PDVC on Your Own Videos

Training and Validation

Download Video Features

Dense Video Captioning

Performance

Dense video captioning (with learnt proposals)

Acknowledgement

About

Releases

Packages

Contributors 6

Languages

License

ologandavid/DenseVideoCaptioning

Folders and files

Latest commit

History

Repository files navigation

Modified PDVC with Semantic Alignment

Updates

Introduction

Installation

Running PDVC on Your Own Videos

Training and Validation

Download Video Features

Dense Video Captioning

Performance

Dense video captioning (with learnt proposals)

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages