-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #4 from tsaishien-chen/main
Push code
- Loading branch information
Showing
386 changed files
with
44,352 additions
and
168 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# 🐼 Panda-70M: Video Captioning | ||
|
||
## Introduction | ||
We propose a video captioning model to generate a caption for a short video clip. | ||
The model includes vision (green) and textual (blue) branches to benefit video captioning by both video and text inputs. | ||
We release the checkpoint trained on Panda-70M. | ||
<p align="center" width="100%"> | ||
<a target="_blank"><img src="assets/architecture.png" style="width: 60%; min-width: 200px; display: block; margin: auto;"></a> | ||
</p> | ||
|
||
## Preparations | ||
### Setup Repository and Enviroment | ||
``` | ||
git clone https://github.com/tsaishien-chen/Panda-70M.git | ||
cd Panda-70M/captioning | ||
# create a conda environment | ||
conda create --name panda70m_captioning python=3.9 -y | ||
conda activate panda70m_captioning | ||
pip install -r requirements.txt | ||
# install ffmpeg | ||
apt-get update -y | ||
apt-get install -y default-jre | ||
``` | ||
### Download Checkpoint | ||
You can manually download the file [here](https://drive.google.com/file/d/1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5/view?usp=sharing) (3.82GB) and move it to the `checkpoint` folder or run: | ||
``` | ||
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5" -O checkpoint/checkpoint_best.pth && rm -rf /tmp/cookies.txt | ||
``` | ||
### Prepare Vicuna: | ||
- Please follow the [intructions](https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md) from FastChat to install **vicuna-7b-v0** weight. | ||
- **[Note]** You need to apply delta weights and after processed, the weights should be moved to `vicuna_weights/vicuna-7b-v0` folder with the file list like [this](https://github.com/tsaishien-chen/Panda-70M/blob/main/captioning/vicuna_weights/vicuna-7b-v0/README.md). | ||
|
||
## Quick Demo | ||
``` | ||
python inference.py --video-list inputs/video_list.txt --prompt-list inputs/prompt_list.txt | ||
``` | ||
The code will caption two test videos listed in the `video_list.txt` with the extra inputs of textual information from the `prompt_list.txt`. Here are some output examples: | ||
<table class="center"> | ||
<tr style="line-height: 0"> | ||
<td width=30% style="border: none; text-align: center"><b>Input Video</b></td> | ||
<td width=50% style="border: none; text-align: center"><b>Input Text</b></td> | ||
<td width=20% style="border: none; text-align: center"><b>Output Caption</b></td> | ||
</tr> | ||
<tr> | ||
<td width=30% style="border: none"><img src="assets/video1.gif" style="width:100%"></td> | ||
<td width=50% style="border: none; text-align: center"><sup> | ||
Some information about a video you will get:<br> | ||
Transcription: Today we're gonna take a quick look at the 1966 Ford Mustang GT 289 v8 under the hood.<br> | ||
Metadata: ['Old VS New - 1966 Ford Mustang GT & 2018 Ford Mustang | Just a Quick Look', 'Lets check out this beautiful 1966 Ford Mustang GT 289 in the showroom with the 2018 Ford Mustang!']<br> | ||
Please look at the video and faithfully summarize it in one sentence.</sup></td> | ||
<td width=20% style="border: none; text-align: center">A red mustang parked in a showroom with american flags hanging from the ceiling.</td> | ||
</tr> | ||
<tr> | ||
<td width=30% style="border: none"><img src="assets/video2.gif" style="width:100%"></td> | ||
<td width=50% style="border: none; text-align: center">Please faithfully summarize the following video in one sentence.</td> | ||
<td width=20% style="border: none; text-align: center">An aerial view of a city with a river running through it.</td> | ||
</tr> | ||
</table> | ||
|
||
<sup>**We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact tsaishienchen at gmail dot com for the request.</sup> | ||
|
||
- **[Note]** You might get different outputs due to the randomness of LLM's generation. | ||
|
||
## Evaluation | ||
### Zero-shot Captioning Performance | ||
| | BLEU-4 | ROUGE-L | METEOR | CIDEr | BertScore | | ||
|------------|--------|---------|--------|-------|-----------| | ||
| **MSRVTT** | 25.4% | 50.1% | 27.7% | 31.5% | 87.9% | | ||
| **MSVD** | 32.8% | 61.2% | 35.3% | 49.2% | 90.2% | | ||
|
||
- **[Note]** The results might not be perfectly reproduced due to the randomness of LLM's generation and could have an deviation of ±0.5%. | ||
|
||
### Prepare Testing Data | ||
- You can download the video samples here [[MSRVTT](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip) / [MSVD](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/)] and move them to `test_datasets/video_samples/MSRVTT` or `MSVD` folder. | ||
- The caption annotations of the testing samples are already saved in `test_datasets/anno_downstream` folder. | ||
|
||
### Evaluation | ||
``` | ||
# MSRVTT | ||
python inference.py --video-list test_datasets/video_list/msrvtt_test.txt --output-json msrvtt_caption.json | ||
python compute_results.py --predict-json msrvtt_caption.json --target-json test_datasets/anno_downstream/msrvtt_caption_test.json | ||
# MSVD | ||
python inference.py --video-list test_datasets/video_list/msvd_test.txt --output-json msvd_caption.json | ||
python compute_results.py --predict-json msvd_caption.json --target-json test_datasets/anno_downstream/msvd_caption_test.json | ||
``` | ||
|
||
## Acknowledgements | ||
The code for video captioning is built upon [Video-LLaMA](https://github.com/DAMO-NLP-SG/Video-LLaMA). | ||
Thanks for sharing the great work! |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Put the model checkpoint here |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer | ||
from pycocoevalcap.bleu.bleu import Bleu | ||
from pycocoevalcap.meteor.meteor import Meteor | ||
from pycocoevalcap.rouge.rouge import Rouge | ||
from pycocoevalcap.cider.cider import Cider | ||
from bert_score import score as bert_score_compute | ||
from tqdm import tqdm | ||
from collections import defaultdict | ||
import pandas as pd | ||
import argparse | ||
import json | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="Evaluation") | ||
parser.add_argument("--predict-json", required=True, help="prediction json file.") | ||
parser.add_argument("--target-json", required=True, help="ground truth json file.") | ||
args = parser.parse_args() | ||
|
||
pd = json.load(open(args.predict_json)) | ||
gt = json.load(open(args.target_json)) | ||
pds = defaultdict(list) | ||
gts = defaultdict(list) | ||
pds_all = [] | ||
gts_all = [] | ||
|
||
for i, data in enumerate(gt): | ||
video, captions = data["video"], data["caption"] | ||
pds[i].append({"image_id":video, "caption":pd[video]}) | ||
pds_all += ([pd[video]]*len(captions)) | ||
|
||
for caption in captions: | ||
gts[i].append({"image_id":video, "caption":caption}) | ||
gts_all += captions | ||
|
||
tokenizer = PTBTokenizer() | ||
pds = tokenizer.tokenize(pds) | ||
gts = tokenizer.tokenize(gts) | ||
scorers = [(Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]), | ||
(Meteor(),"METEOR"), | ||
(Rouge(), "ROUGE_L"), | ||
(Cider(), "CIDEr")] | ||
|
||
eval_dict = {} | ||
for scorer, method in scorers: | ||
score, scores = scorer.compute_score(gts, pds) | ||
if scorer.method() == "Bleu": | ||
eval_dict["BLEU4"] = score[3] | ||
else: | ||
eval_dict[scorer.method()] = score | ||
|
||
_, _, score = bert_score_compute(pds_all, gts_all, lang='en', verbose=False) | ||
eval_dict["BERTScore"] = score.mean().item() | ||
|
||
for k, v in eval_dict.items(): | ||
print("%s: %.2f%%"%(k, v*100)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
model: | ||
arch: video_llama | ||
model_type: pretrain_vicuna | ||
input_prompt: True | ||
ckpt: "checkpoint/checkpoint_best.pth" | ||
|
||
# Q-Former | ||
num_query_token: 32 | ||
|
||
# Vicuna | ||
llama_model: "vicuna_weights/vicuna-7b-v0" | ||
|
||
# Branch | ||
fusion_head_layers: 2 | ||
max_frame_pos: 32 | ||
fusion_header_type: "seqTransf" | ||
num_video_query_token: 32 | ||
num_text_query_token: 32 | ||
input_vid2tex_query_embed: True | ||
detach_video_query_embed: True | ||
|
||
max_caption_len: 48 | ||
max_prompt_len: 200 | ||
start_sym: "<s>" | ||
end_sym: "</s>" | ||
|
||
datasets: | ||
hdvila: | ||
vis_processor: | ||
train: | ||
name: "alpro_video_eval" | ||
n_frms: 8 | ||
image_size: 224 | ||
text_processor: | ||
train: | ||
name: "blip_caption" | ||
|
||
run: | ||
task: video_text_pretrain |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
import glob | ||
import argparse | ||
import torch | ||
import json | ||
import os | ||
from video_llama.common.config import Config | ||
from video_llama.common.registry import registry | ||
from video_llama.processors.video_processor import load_video | ||
from transformers import StoppingCriteria, StoppingCriteriaList | ||
from tqdm import tqdm | ||
|
||
|
||
class DotDict(dict): | ||
"""dot.notation access to dictionary attributes""" | ||
__getattr__ = dict.get | ||
__setattr__ = dict.__setitem__ | ||
__delattr__ = dict.__delitem__ | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="Inference") | ||
parser.add_argument("--cfg-path", default="eval_configs/panda70M_eval.yaml", help="path to configuration file.") | ||
parser.add_argument("--video-list", required=True, help="list of input videos.") | ||
parser.add_argument("--output-json", default=None, help="output json file. Leave none to print out the results.") | ||
parser.add_argument("--prompt-list", default=None, help="list of correponding input prompts. Leave none if no prompt input.") | ||
args = parser.parse_args() | ||
cfg = Config(args) | ||
|
||
model_config = cfg.model_cfg | ||
model_cls = registry.get_model_class(model_config.arch) | ||
model = model_cls.from_config(model_config).to("cuda") | ||
model.eval() | ||
|
||
vis_processor_cfg = DotDict({"name":"alpro_video_eval", "n_frms":8, "image_size":224}) | ||
vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg) | ||
text_processor_cfg = DotDict({"name":"blip_caption", "max_words":100}) | ||
text_processor = registry.get_processor_class(text_processor_cfg.name).from_config(text_processor_cfg) | ||
|
||
batch_size = 16 | ||
|
||
videos = open(args.video_list, "r").read().splitlines() | ||
if args.prompt_list: | ||
prompts = open(args.prompt_list, "r").read().split("\n\n") | ||
|
||
results = {} | ||
for i in tqdm(range(0, len(videos), batch_size)): | ||
video_batch = [] | ||
video_path_batch = [] | ||
prompt_batch = [] | ||
|
||
for j in range(i, min(i+batch_size, len(videos))): | ||
try: | ||
video_path = videos[j] | ||
video = load_video(video_path=video_path, n_frms=8, sampling ="uniform") | ||
video = vis_processor.transform(video) | ||
assert video.shape == torch.Size([3, 8, 224, 224]) | ||
except Exception as e: | ||
print(e) | ||
continue | ||
|
||
video_batch.append(video) | ||
video_path_batch.append(video_path.split('/')[-1]) | ||
prompt_batch.append(prompts[j] if args.prompt_list else "Please faithfully summarize the following video in one sentence.") | ||
|
||
video_batch = torch.stack(video_batch).to("cuda") | ||
outputs = model.inference(video_batch, prompt_batch) | ||
|
||
for video_path, output in zip(video_path_batch, outputs): | ||
output = output.capitalize()+"." | ||
if args.output_json: | ||
results[video_path] = output | ||
else: | ||
print("====="*20) | ||
print("[Input video]", video_path) | ||
print("[Input prompt]") | ||
print(prompt_batch[j-i]) | ||
print("[Output caption]", output) | ||
|
||
if args.output_json: | ||
results = json.dumps(results, indent = 4) | ||
with open(args.output_json, "w") as f: | ||
f.write(results) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
Some information about a video you will get: | ||
Transcription: Today we're gonna take a quick look at the 1966 Ford Mustang GT 289 v8 under the hood. | ||
Metadata: ['Old VS New - 1966 Ford Mustang GT & 2018 Ford Mustang | Just a Quick Look', 'Lets check out this beautiful 1966 Ford Mustang GT 289 in the showroom with the 2018 Ford Mustang!'] | ||
Please look at the video and faithfully summarize it in one sentence. | ||
|
||
Please faithfully summarize the following video in one sentence. |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
inputs/video1.mp4 | ||
inputs/video2.mp4 |
Oops, something went wrong.