Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
ArrowLuo committed Apr 8, 2021
1 parent 1a40788 commit ec3480e
Show file tree
Hide file tree
Showing 2 changed files with 123 additions and 7 deletions.
53 changes: 46 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
*WORK IN PROGRESS ...*

The implementation of paper [**UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation**](https://arxiv.org/abs/2002.06353).

UniVL is a **video-language pretrain model**. It is designed with four modules and five objectives for both video language understanding and generation tasks. It is also a flexible model for most of the multimodal downstream tasks considering both efficiency and effectiveness.

# Preliminary
Excute below scripts in the main folder firstly.
Execute below scripts in the main folder firstly. It will avoid *download conflict* when doing distributed pretrain.
```
mkdir modules/bert-base-uncased
cd modules/bert-base-uncased/
Expand All @@ -32,6 +31,33 @@ pip install torch==1.7.1+cu92
pip install git+https://github.com/Maluuba/nlg-eval.git@master
```

# Pretrained Weight
```
mkdir -p ./weight
wget -P ./weight [TBD]
```

# Prepare for Evaluation
Get data for retrieval and caption (with only video input) on YoucookII and MSRVTT.
## YoucookII
```
mkdir -p data
cd data
wget [TBD]
unzip youcookii.zip
cd ..
```
Note: you can find `youcookii_data.no_transcript.pickle` in the zip file, which is a version without transcript. The transcript version will not be publicly avaliable due to possible legal issue. Thus, you need to replace `youcookii_data.pickle` with `youcookii_data.no_transcript.pickle` for youcook retrieval task and *caption with only video input* task. S3D feature can be found in `youcookii_videos_features.pickle`. The feature is extract as one 1024-dimension vector per second. More details can be found in [dataloaders](./dataloaders/README.md) and our paper.

## MSRVTT
```
mkdir -p data
cd data
wget [TBD]
unzip msrvtt.zip
cd ..
```

# Finetune on YoucookII
## Retrieval

Expand All @@ -57,10 +83,14 @@ main_task_retrieval.py \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_youcook_retrieval --bert_model bert-base-uncased \
--do_lower_case --lr 3e-5 --max_words 48 --max_frames 48 \
--batch_size_val 200 --visual_num_hidden_layers 6 \
--batch_size_val 64 --visual_num_hidden_layers 6 \
--datatype ${DATATYPE} --init_model ${INIT_MODEL}
```
The results are close to `R@1: 0.2269 - R@5: 0.5245 - R@10: 0.6586 - Median R: 5.0`
The results (FT-Joint) are close to `R@1: 0.2269 - R@5: 0.5245 - R@10: 0.6586 - Median R: 5.0`

Plus `--train_sim_after_cross` to train align approach (FT-Align),

The results (FT-Align) are close to `R@1: 0.2890 - R@5: 0.5760 - R@10: 0.7000 - Median R: 4.0`

2. Run retrieval task on **MSRVTT**
```
Expand All @@ -83,12 +113,14 @@ main_task_retrieval.py \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_retrieval --bert_model bert-base-uncased \
--do_lower_case --lr 5e-5 --max_words 48 --max_frames 48 \
--batch_size_val 200 --visual_num_hidden_layers 6 \
--batch_size_val 64 --visual_num_hidden_layers 6 \
--datatype ${DATATYPE} --expand_msrvtt_sentences --init_model ${INIT_MODEL}
```
The results are close to
The results (FT-Joint) are close to
`R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0`

Plus `--train_sim_after_cross` to train align approach (FT-Align)

## Caption
Run caption task on **YoucookII**

Expand Down Expand Up @@ -121,6 +153,13 @@ BLEU_1: 0.4746, BLEU_2: 0.3355, BLEU_3: 0.2423, BLEU_4: 0.1779
METEOR: 0.2261, ROUGE_L: 0.4697, CIDEr: 1.8631
```

If using video only as input (`youcookii_data.no_transcript.pickle`),
>The results are close to
```
BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725
```

Run caption task on **MSRVTT**

```
Expand Down Expand Up @@ -223,4 +262,4 @@ This project is licensed under the license found in the LICENSE file in the root
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

# Acknowledgments
Our code is based on [pytorch-transformers v0.4.0](https://github.com/huggingface/transformers/tree/v0.4.0). We thank the authors for their wonderful open-source efforts.
Our code is based on [pytorch-transformers v0.4.0](https://github.com/huggingface/transformers/tree/v0.4.0) and [howto100m](https://github.com/antoine77340/howto100m). We thank the authors for their wonderful open-source efforts.
77 changes: 77 additions & 0 deletions dataloaders/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Data loaders for pretrain and downstream tasks (retrieval and caption).

## Preprocess on HowTo100M

For pretrain, you need to prepare 3 parts,

### 1. s3d features pretrained on HowTo100M

Download raw videos from the [HowTo100M webpage]([https://www.di.ens.fr/willow/research/howto100m/](https://www.di.ens.fr/willow/research/howto100m/)) and extract [s3d (howto100m)](https://github.com/antoine77340/S3D_HowTo100M) features. You can refer to [VideoFeatureExtractor](https://github.com/ArrowLuo/VideoFeatureExtractor).

### 2. HowTo100M.csv
Note: this file is different from HowTo100M_v1.csv as in [README.txt](https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/README.txt)

The csv format contains two columns. The first column is the video id, and the second is the feature file (sub-path of the npy, which will post append to `--features_path` (refer to pretrain part in [README](../README.md)) to find the npy file when reading).

```
video_id,feature_file
Z8xhli297v8,Z8xhli297v8.npy
...
```
video_id: used to match the caption or transcript
feature_file: used to find the feature file after joining with `--features_path`

### 3. caption.pickle
This pickle file is generated from raw_caption.json in raw_caption.zip introduced in [README.txt](https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/README.txt)

The format of this file is:
```
{
'video_id 1':{
'start': array([0.08, 7.37, 15.05, ...], dtype=object),
'end': array([9.96, 16.98, 27.9, ...], dtype=object),
'text': array(['sentence 1 placehodolder',
'sentence 2 placehodolder',
'sentence 3 placehodolder', ...], dtype=object)
},
...
}
```
Keep the `start` is a sorted array.


## Preprocess on YoucookII
The s3d feature extraction is the same as HowTo100M introduced above.

## Generate youcookii_data.pickle
This file is generated from `youcookii_annotations_trainval.json`, which can be downloaded from [official webpage](http://youcook2.eecs.umich.edu/download).

The format of this file is (similar to `caption.pickle` introduced above, but one more key `transcript`. The `transcript` needs to generated by extra ASR tool from speech.):
```
{
'video_id 1':{
'start': array([0.08, 7.37, 15.05, ...], dtype=object),
'end': array([9.96, 16.98, 27.9, ...], dtype=object),
'text': array(['sentence 1 placehodolder',
'sentence 2 placehodolder',
'sentence 3 placehodolder', ...], dtype=object)
'transcript': array(['transcript 1 placehodolder',
'transcript 2 placehodolder',
'transcript 3 placehodolder', ...], dtype=object)
},
...
}
```
If you want to test on retrieval or caption w/o transcript tasks, you can set `transcript` with `array(['NONE', 'NONE', 'NONE', ...], dtype=object)`.

## Format of csv
```
video_id,feature_file
Z8xhli297v8,Z8xhli297v8
...
```
Note: The video_id and feature_file are the same for the consistency and our historical compatibility. We use feature_file to get the feature from feature pickle.

## Preprocess on MSRVTT
The s3d feature extraction is the same as HowTo100M introduced above.
The data can be downloaded in: [TBD]

0 comments on commit ec3480e

Please sign in to comment.