Skip to content

Commit

Permalink
Merge pull request #4 from tsaishien-chen/main
Browse files Browse the repository at this point in the history
Push code
  • Loading branch information
AliaksandrSiarohin authored Mar 2, 2024
2 parents d58f78c + b7206e3 commit bd9a8cd
Show file tree
Hide file tree
Showing 386 changed files with 44,352 additions and 168 deletions.
95 changes: 93 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,101 @@ Ming-Hsuan Yang,
Sergey Tulyakov

<!-- [Arxiv Report](https://arxiv.org/abs/2307.04725) | [Project Page](https://snap-research.github.io/Panda-70M) -->
[![arXiv](https://img.shields.io/badge/arXiv-2312.00000-b31b1b.svg)](https://arxiv.org/abs/2312.00000)
[![arXiv](https://img.shields.io/badge/arXiv-2402.19479-b31b1b.svg)](https://arxiv.org/abs/2402.19479)
[![Project Page](https://img.shields.io/badge/Project-Website-green)](https://snap-research.github.io/Panda-70M)

*Code is coming soon!*
## Introduction
Panda-70M is a large-scale dataset with 70M high-quality video-caption pairs.
This repository have three sections:
- [Dataset Dataloading](./dataset_dataloading) includes the csv files listing the data of Panda-70M and the code to download the dataset.
- [Splitting](./splitting) includes the code to split a long video into multiple semantics-consistent short clips.
- [Captioning](./captioning) includes the proposed video captioning model trained on Panda-70M.

## Dataset
### Collection Pipeline
<p align="center" width="100%">
<a target="_blank"><img src="assets/collection_pipeline.gif" style="width: 100%; min-width: 200px; display: block; margin: auto;"></a>
</p>

### Download
| Split | Download | # Source Videos | # Samples | Video Duration | Storage Space|
|-----------------|----------|-----------------|-----------|----------------|--------------|
| Training (full) | [link](https://drive.google.com/file/d/1DeODUcdJCEfnTjJywM-ObmrlVg-wsvwz/view?usp=sharing) (2.01 GB) | 3,779,763 | 70,723,513 | 167 khrs | ~36 TB |
| Training (10M) | [link](https://drive.google.com/file/d/1Lrsb65HTJ2hS7Iuy6iPCmjoc3abbEcAX/view?usp=sharing) (381 MB) | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB |
| Training (2M) | [link](https://drive.google.com/file/d/1jWTNGjb-hkKiPHXIbEA5CnFwjhA-Fq_Q/view?usp=sharing) (86.5 MB) | 800,000 | 2,400,000 | 7.56 khrs | ~1.6 TB |
| Validation | [link](https://drive.google.com/file/d/1cTCaC7oJ9ZMPSax6I4ZHvUT-lqxOktrX/view?usp=sharing) (803 KB) | 2,000 | 6,000 | 18.5 hrs | ~4.0 GB |
| Testing | [link](https://drive.google.com/file/d/1ee227tHEO-DT8AkX7y2q6-bfAtUL-yMI/view?usp=sharing) (803 KB) | 2,000 | 6,000 | 18.5 hrs | ~4.0 GB |

More details can be found in [Dataset Dataloading](./dataset_dataloading) section.

## Demonstration
### Video-Caption Pairs in Panda-70M
<table class="center">
<tr>
<td width=33.3% style="border: none"><img src="./assets/aIPu1xGNbhc.49.gif"></td>
<td width=33.3% style="border: none"><img src="./assets/AIyw1FO1aqs.57.gif"></td>
<td width=33.3% style="border: none"><img src="./assets/Kb8ON0iCs38.97.gif"></td>
</tr>
<tr style="text-align: center;">
<td width=33.3% style="border: none">A rhino and a lion are fighting in the dirt.</td>
<td width=33.3% style="border: none">A person is holding a long haired dachshund in their arms.</td>
<td width=33.3% style="border: none">A rocket launches into space on the launch pad.</td>
</tr>
</table>

<table class="center">
<tr>
<td width=33.3% style="border: none"><img src="./assets/AvVDsFBc6bA.0.gif"></td>
<td width=33.3% style="border: none"><img src="./assets/S-1NdEjjg7c.58.gif"></td>
<td width=33.3% style="border: none"><img src="./assets/10Y6wIEuG00.62.gif"></td>
</tr>
<tr style="text-align: center;">
<td width=33.3% style="border: none">A person is kneading dough and putting jam on it.</td>
<td width=33.3% style="border: none">A little boy is playing with a basketball in the city.</td>
<td width=33.3% style="border: none">A 3d rendering of a zoo with animals and a train.</td>
</tr>
</table>

<table class="center">
<tr>
<td width=33.3% style="border: none"><img src="./assets/_uQs-YDb5VA.9.gif"></td>
<td width=33.3% style="border: none"><img src="./assets/CgcadSRtAag.140.gif"></td>
<td width=33.3% style="border: none"><img src="./assets/1NMpoAqzJfY.25.gif"></td>
</tr>
<tr style="text-align: center;">
<td width=33.3% style="border: none">A person in blue gloves is connecting an electrical supply to an injector.</td>
<td width=33.3% style="border: none">There is a beach with waves and rocks in the foreground, and a city skyline in the background.</td>
<td width=33.3% style="border: none">It is a rally car driving on a dirt road in the countryside, with people watching from the side of the road.</td>
</tr>
</table>

<sup>**We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact tsaishienchen at gmail dot com for the request.</sup>

Please check [here](https://snap-research.github.io/Panda-70M/more_samples) for more samples.

### Long Video Splitting and Captioning
https://github.com/tsaishien-chen/Panda-70M/assets/43384650/481b369a-122b-4571-a83e-416201ebd6c9

https://github.com/tsaishien-chen/Panda-70M/assets/43384650/fee5468d-815f-41a7-8202-bdb3b60fcac7

## License of Panda-70M

See [license](https://github.com/tsaishien-chen/Panda-70M/blob/main/LICENSE).
The video samples are collected from a publicly available dataset.
Users must follow [the related license](https://raw.githubusercontent.com/microsoft/XPretrain/main/hd-vila-100m/LICENSE) to use these video samples.

## Citation

If you find this project useful for your research, please cite our paper. :blush:

```bibtex
@article{chen2024panda70M,
title = {Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers},
author = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Deyneka, Ekaterina and Chao, Hsiang-wei and Jeon, Byung Eun and Fang, Yuwei and Lee, Hsin-Ying and Ren, Jian and Yang, Ming-Hsuan and Tulyakov, Sergey},
journal = {arXiv preprint arXiv:2402.19479},
year = {2024}
}
```

## Contact Information
**Tsai-Shien Chen**: [tsaishienchen@gmail.com](mailto:tsaishienchen@gmail.com)
Binary file added assets/10Y6wIEuG00.62.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/1NMpoAqzJfY.25.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/AIyw1FO1aqs.57.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/AvVDsFBc6bA.0.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/CgcadSRtAag.140.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/Kb8ON0iCs38.97.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/S-1NdEjjg7c.58.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/_uQs-YDb5VA.9.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/aIPu1xGNbhc.49.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/collection_pipeline.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
92 changes: 92 additions & 0 deletions captioning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# 🐼 Panda-70M: Video Captioning

## Introduction
We propose a video captioning model to generate a caption for a short video clip.
The model includes vision (green) and textual (blue) branches to benefit video captioning by both video and text inputs.
We release the checkpoint trained on Panda-70M.
<p align="center" width="100%">
<a target="_blank"><img src="assets/architecture.png" style="width: 60%; min-width: 200px; display: block; margin: auto;"></a>
</p>

## Preparations
### Setup Repository and Enviroment
```
git clone https://github.com/tsaishien-chen/Panda-70M.git
cd Panda-70M/captioning
# create a conda environment
conda create --name panda70m_captioning python=3.9 -y
conda activate panda70m_captioning
pip install -r requirements.txt
# install ffmpeg
apt-get update -y
apt-get install -y default-jre
```
### Download Checkpoint
You can manually download the file [here](https://drive.google.com/file/d/1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5/view?usp=sharing) (3.82GB) and move it to the `checkpoint` folder or run:
```
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5" -O checkpoint/checkpoint_best.pth && rm -rf /tmp/cookies.txt
```
### Prepare Vicuna:
- Please follow the [intructions](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md) from FastChat to install **vicuna-7b-v0** weight.
- **[Note]** You need to apply delta weights and after processed, the weights should be moved to `vicuna_weights/vicuna-7b-v0` folder with the file list like [this](https://github.com/tsaishien-chen/Panda-70M/blob/main/captioning/vicuna_weights/vicuna-7b-v0/README.md).

## Quick Demo
```
python inference.py --video-list inputs/video_list.txt --prompt-list inputs/prompt_list.txt
```
The code will caption two test videos listed in the `video_list.txt` with the extra inputs of textual information from the `prompt_list.txt`. Here are some output examples:
<table class="center">
<tr style="line-height: 0">
<td width=30% style="border: none; text-align: center"><b>Input Video</b></td>
<td width=50% style="border: none; text-align: center"><b>Input Text</b></td>
<td width=20% style="border: none; text-align: center"><b>Output Caption</b></td>
</tr>
<tr>
<td width=30% style="border: none"><img src="assets/video1.gif" style="width:100%"></td>
<td width=50% style="border: none; text-align: center"><sup>
Some information about a video you will get:<br>
Transcription: Today we're gonna take a quick look at the 1966 Ford Mustang GT 289 v8 under the hood.<br>
Metadata: ['Old VS New - 1966 Ford Mustang GT & 2018 Ford Mustang | Just a Quick Look', 'Lets check out this beautiful 1966 Ford Mustang GT 289 in the showroom with the 2018 Ford Mustang!']<br>
Please look at the video and faithfully summarize it in one sentence.</sup></td>
<td width=20% style="border: none; text-align: center">A red mustang parked in a showroom with american flags hanging from the ceiling.</td>
</tr>
<tr>
<td width=30% style="border: none"><img src="assets/video2.gif" style="width:100%"></td>
<td width=50% style="border: none; text-align: center">Please faithfully summarize the following video in one sentence.</td>
<td width=20% style="border: none; text-align: center">An aerial view of a city with a river running through it.</td>
</tr>
</table>

<sup>**We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact tsaishienchen at gmail dot com for the request.</sup>

- **[Note]** You might get different outputs due to the randomness of LLM's generation.

## Evaluation
### Zero-shot Captioning Performance
| | BLEU-4 | ROUGE-L | METEOR | CIDEr | BertScore |
|------------|--------|---------|--------|-------|-----------|
| **MSRVTT** | 25.4% | 50.1% | 27.7% | 31.5% | 87.9% |
| **MSVD** | 32.8% | 61.2% | 35.3% | 49.2% | 90.2% |

- **[Note]** The results might not be perfectly reproduced due to the randomness of LLM's generation and could have an deviation of ±0.5%.

### Prepare Testing Data
- You can download the video samples here [[MSRVTT](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip) / [MSVD](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/)] and move them to `test_datasets/video_samples/MSRVTT` or `MSVD` folder.
- The caption annotations of the testing samples are already saved in `test_datasets/anno_downstream` folder.

### Evaluation
```
# MSRVTT
python inference.py --video-list test_datasets/video_list/msrvtt_test.txt --output-json msrvtt_caption.json
python compute_results.py --predict-json msrvtt_caption.json --target-json test_datasets/anno_downstream/msrvtt_caption_test.json
# MSVD
python inference.py --video-list test_datasets/video_list/msvd_test.txt --output-json msvd_caption.json
python compute_results.py --predict-json msvd_caption.json --target-json test_datasets/anno_downstream/msvd_caption_test.json
```

## Acknowledgements
The code for video captioning is built upon [Video-LLaMA](https://github.com/DAMO-NLP-SG/Video-LLaMA).
Thanks for sharing the great work!
Binary file added captioning/assets/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added captioning/assets/video1.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added captioning/assets/video2.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions captioning/checkpoint/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Put the model checkpoint here
56 changes: 56 additions & 0 deletions captioning/compute_results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider
from bert_score import score as bert_score_compute
from tqdm import tqdm
from collections import defaultdict
import pandas as pd
import argparse
import json


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Evaluation")
parser.add_argument("--predict-json", required=True, help="prediction json file.")
parser.add_argument("--target-json", required=True, help="ground truth json file.")
args = parser.parse_args()

pd = json.load(open(args.predict_json))
gt = json.load(open(args.target_json))
pds = defaultdict(list)
gts = defaultdict(list)
pds_all = []
gts_all = []

for i, data in enumerate(gt):
video, captions = data["video"], data["caption"]
pds[i].append({"image_id":video, "caption":pd[video]})
pds_all += ([pd[video]]*len(captions))

for caption in captions:
gts[i].append({"image_id":video, "caption":caption})
gts_all += captions

tokenizer = PTBTokenizer()
pds = tokenizer.tokenize(pds)
gts = tokenizer.tokenize(gts)
scorers = [(Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
(Meteor(),"METEOR"),
(Rouge(), "ROUGE_L"),
(Cider(), "CIDEr")]

eval_dict = {}
for scorer, method in scorers:
score, scores = scorer.compute_score(gts, pds)
if scorer.method() == "Bleu":
eval_dict["BLEU4"] = score[3]
else:
eval_dict[scorer.method()] = score

_, _, score = bert_score_compute(pds_all, gts_all, lang='en', verbose=False)
eval_dict["BERTScore"] = score.mean().item()

for k, v in eval_dict.items():
print("%s: %.2f%%"%(k, v*100))
39 changes: 39 additions & 0 deletions captioning/eval_configs/panda70M_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
model:
arch: video_llama
model_type: pretrain_vicuna
input_prompt: True
ckpt: "checkpoint/checkpoint_best.pth"

# Q-Former
num_query_token: 32

# Vicuna
llama_model: "vicuna_weights/vicuna-7b-v0"

# Branch
fusion_head_layers: 2
max_frame_pos: 32
fusion_header_type: "seqTransf"
num_video_query_token: 32
num_text_query_token: 32
input_vid2tex_query_embed: True
detach_video_query_embed: True

max_caption_len: 48
max_prompt_len: 200
start_sym: "<s>"
end_sym: "</s>"

datasets:
hdvila:
vis_processor:
train:
name: "alpro_video_eval"
n_frms: 8
image_size: 224
text_processor:
train:
name: "blip_caption"

run:
task: video_text_pretrain
82 changes: 82 additions & 0 deletions captioning/inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import glob
import argparse
import torch
import json
import os
from video_llama.common.config import Config
from video_llama.common.registry import registry
from video_llama.processors.video_processor import load_video
from transformers import StoppingCriteria, StoppingCriteriaList
from tqdm import tqdm


class DotDict(dict):
"""dot.notation access to dictionary attributes"""
__getattr__ = dict.get
__setattr__ = dict.__setitem__
__delattr__ = dict.__delitem__


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Inference")
parser.add_argument("--cfg-path", default="eval_configs/panda70M_eval.yaml", help="path to configuration file.")
parser.add_argument("--video-list", required=True, help="list of input videos.")
parser.add_argument("--output-json", default=None, help="output json file. Leave none to print out the results.")
parser.add_argument("--prompt-list", default=None, help="list of correponding input prompts. Leave none if no prompt input.")
args = parser.parse_args()
cfg = Config(args)

model_config = cfg.model_cfg
model_cls = registry.get_model_class(model_config.arch)
model = model_cls.from_config(model_config).to("cuda")
model.eval()

vis_processor_cfg = DotDict({"name":"alpro_video_eval", "n_frms":8, "image_size":224})
vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)
text_processor_cfg = DotDict({"name":"blip_caption", "max_words":100})
text_processor = registry.get_processor_class(text_processor_cfg.name).from_config(text_processor_cfg)

batch_size = 16

videos = open(args.video_list, "r").read().splitlines()
if args.prompt_list:
prompts = open(args.prompt_list, "r").read().split("\n\n")

results = {}
for i in tqdm(range(0, len(videos), batch_size)):
video_batch = []
video_path_batch = []
prompt_batch = []

for j in range(i, min(i+batch_size, len(videos))):
try:
video_path = videos[j]
video = load_video(video_path=video_path, n_frms=8, sampling ="uniform")
video = vis_processor.transform(video)
assert video.shape == torch.Size([3, 8, 224, 224])
except Exception as e:
print(e)
continue

video_batch.append(video)
video_path_batch.append(video_path.split('/')[-1])
prompt_batch.append(prompts[j] if args.prompt_list else "Please faithfully summarize the following video in one sentence.")

video_batch = torch.stack(video_batch).to("cuda")
outputs = model.inference(video_batch, prompt_batch)

for video_path, output in zip(video_path_batch, outputs):
output = output.capitalize()+"."
if args.output_json:
results[video_path] = output
else:
print("====="*20)
print("[Input video]", video_path)
print("[Input prompt]")
print(prompt_batch[j-i])
print("[Output caption]", output)

if args.output_json:
results = json.dumps(results, indent = 4)
with open(args.output_json, "w") as f:
f.write(results)
6 changes: 6 additions & 0 deletions captioning/inputs/prompt_list.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Some information about a video you will get:
Transcription: Today we're gonna take a quick look at the 1966 Ford Mustang GT 289 v8 under the hood.
Metadata: ['Old VS New - 1966 Ford Mustang GT & 2018 Ford Mustang | Just a Quick Look', 'Lets check out this beautiful 1966 Ford Mustang GT 289 in the showroom with the 2018 Ford Mustang!']
Please look at the video and faithfully summarize it in one sentence.

Please faithfully summarize the following video in one sentence.
Binary file added captioning/inputs/video1.mp4
Binary file not shown.
Binary file added captioning/inputs/video2.mp4
Binary file not shown.
2 changes: 2 additions & 0 deletions captioning/inputs/video_list.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
inputs/video1.mp4
inputs/video2.mp4
Loading

0 comments on commit bd9a8cd

Please sign in to comment.