update readme

microsoft · Apr 8, 2021 · ec3480e · ec3480e
1 parent 1a40788
commit ec3480e
Show file tree

Hide file tree

Showing 2 changed files with 123 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -1,11 +1,10 @@
-*WORK IN PROGRESS ...*
 
 The implementation of paper [**UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation**](https://arxiv.org/abs/2002.06353). 
 
 UniVL is a **video-language pretrain model**. It is designed with four modules and five objectives for both video language understanding and generation tasks. It is also a flexible model for most of the multimodal downstream tasks considering both efficiency and effectiveness.
 
 # Preliminary
-Excute below scripts in the main folder firstly.
+Execute below scripts in the main folder firstly. It will avoid *download conflict* when doing distributed pretrain.
 ```
 mkdir modules/bert-base-uncased
 cd modules/bert-base-uncased/
@@ -32,6 +31,33 @@ pip install torch==1.7.1+cu92
 pip install git+https://github.com/Maluuba/nlg-eval.git@master
 ```
 
+# Pretrained Weight
+```
+mkdir -p ./weight
+wget -P ./weight [TBD]
+```
+
+# Prepare for Evaluation 
+Get data for retrieval and caption (with only video input) on YoucookII and MSRVTT.
+## YoucookII
+```
+mkdir -p data
+cd data
+wget [TBD]
+unzip youcookii.zip
+cd ..
+```
+Note: you can find `youcookii_data.no_transcript.pickle` in the zip file, which is a version without transcript. The transcript version will not be publicly avaliable due to possible legal issue. Thus, you need to replace `youcookii_data.pickle` with `youcookii_data.no_transcript.pickle` for youcook retrieval task and *caption with only video input* task. S3D feature can be found in `youcookii_videos_features.pickle`. The feature is extract as one 1024-dimension vector per second. More details can be found in [dataloaders](./dataloaders/README.md) and our paper.
+
+## MSRVTT
+```
+mkdir -p data
+cd data
+wget [TBD]
+unzip msrvtt.zip
+cd ..
+```
+
 # Finetune on YoucookII
 ## Retrieval
 
@@ -57,10 +83,14 @@ main_task_retrieval.py \
 --features_path ${FEATURES_PATH} \
 --output_dir ${OUTPUT_ROOT}/ckpt_youcook_retrieval --bert_model bert-base-uncased \
 --do_lower_case --lr 3e-5 --max_words 48 --max_frames 48 \
---batch_size_val 200 --visual_num_hidden_layers 6 \
+--batch_size_val 64 --visual_num_hidden_layers 6 \
 --datatype ${DATATYPE} --init_model ${INIT_MODEL}
 ```
-The results are close to `R@1: 0.2269 - R@5: 0.5245 - R@10: 0.6586 - Median R: 5.0`
+The results (FT-Joint) are close to `R@1: 0.2269 - R@5: 0.5245 - R@10: 0.6586 - Median R: 5.0`
+
+Plus `--train_sim_after_cross` to train align approach (FT-Align),
+
+The results (FT-Align) are close to `R@1: 0.2890 - R@5: 0.5760 - R@10: 0.7000 - Median R: 4.0`
 
 2. Run retrieval task on **MSRVTT**
 ```
@@ -83,12 +113,14 @@ main_task_retrieval.py \
 --features_path ${FEATURES_PATH} \
 --output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_retrieval --bert_model bert-base-uncased \
 --do_lower_case --lr 5e-5 --max_words 48 --max_frames 48 \
---batch_size_val 200 --visual_num_hidden_layers 6 \
+--batch_size_val 64 --visual_num_hidden_layers 6 \
 --datatype ${DATATYPE} --expand_msrvtt_sentences --init_model ${INIT_MODEL}
 ```
-The results are close to 
+The results (FT-Joint) are close to 
 `R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0`
 
+Plus `--train_sim_after_cross` to train align approach (FT-Align)
+
 ## Caption
 Run caption task on **YoucookII**
 
@@ -121,6 +153,13 @@ BLEU_1: 0.4746, BLEU_2: 0.3355, BLEU_3: 0.2423, BLEU_4: 0.1779
 METEOR: 0.2261, ROUGE_L: 0.4697, CIDEr: 1.8631
 ```
 
+If using video only as input (`youcookii_data.no_transcript.pickle`),
+>The results are close to 
+```
+BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
+METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725
+```
+
 Run caption task on **MSRVTT**
 
 ```
@@ -223,4 +262,4 @@ This project is licensed under the license found in the LICENSE file in the root
 [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
 
 # Acknowledgments
-Our code is based on [pytorch-transformers v0.4.0](https://github.com/huggingface/transformers/tree/v0.4.0). We thank the authors for their wonderful open-source efforts.
+Our code is based on [pytorch-transformers v0.4.0](https://github.com/huggingface/transformers/tree/v0.4.0) and [howto100m](https://github.com/antoine77340/howto100m). We thank the authors for their wonderful open-source efforts.
diff --git a/dataloaders/README.md b/dataloaders/README.md
@@ -0,0 +1,77 @@
+Data loaders for pretrain and downstream tasks (retrieval and caption). 
+
+## Preprocess on HowTo100M
+
+For pretrain, you need to prepare 3 parts,
+
+### 1. s3d features pretrained on HowTo100M
+
+Download raw videos from the [HowTo100M webpage]([https://www.di.ens.fr/willow/research/howto100m/](https://www.di.ens.fr/willow/research/howto100m/)) and extract [s3d (howto100m)](https://github.com/antoine77340/S3D_HowTo100M) features. You can refer to [VideoFeatureExtractor](https://github.com/ArrowLuo/VideoFeatureExtractor).
+
+### 2. HowTo100M.csv
+Note: this file is different from HowTo100M_v1.csv as in [README.txt](https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/README.txt)
+
+The csv format contains two columns. The first column is the video id, and the second is the feature file (sub-path of the npy, which will post append to `--features_path` (refer to pretrain part in [README](../README.md)) to find the npy file when reading).
+
+```
+video_id,feature_file
+Z8xhli297v8,Z8xhli297v8.npy
+...
+```
+video_id: used to match the caption or transcript
+feature_file: used to find the feature file after joining with `--features_path`
+
+### 3. caption.pickle
+This pickle file is generated from raw_caption.json in raw_caption.zip introduced in [README.txt](https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/README.txt)
+
+The format of this file is:
+```
+{
+    'video_id 1':{
+        'start': array([0.08, 7.37, 15.05, ...], dtype=object),
+        'end': array([9.96, 16.98, 27.9, ...], dtype=object),
+        'text': array(['sentence 1 placehodolder',
+                    'sentence 2 placehodolder',
+                    'sentence 3 placehodolder', ...], dtype=object)
+    },
+    ...
+}
+```
+Keep the `start` is a sorted array.
+
+
+## Preprocess on YoucookII
+The s3d feature extraction is the same as HowTo100M introduced above.
+
+## Generate youcookii_data.pickle
+This file is generated from `youcookii_annotations_trainval.json`, which can be downloaded from [official webpage](http://youcook2.eecs.umich.edu/download).
+
+The format of this file is (similar to `caption.pickle` introduced above, but one more key `transcript`. The `transcript` needs to generated by extra ASR tool from speech.):
+```
+{
+    'video_id 1':{
+        'start': array([0.08, 7.37, 15.05, ...], dtype=object),
+        'end': array([9.96, 16.98, 27.9, ...], dtype=object),
+        'text': array(['sentence 1 placehodolder',
+                    'sentence 2 placehodolder',
+                    'sentence 3 placehodolder', ...], dtype=object)
+        'transcript': array(['transcript 1 placehodolder',
+                    'transcript 2 placehodolder',
+                    'transcript 3 placehodolder', ...], dtype=object)
+    },
+    ...
+}
+```
+If you want to test on retrieval or caption w/o transcript tasks, you can set `transcript` with `array(['NONE', 'NONE', 'NONE', ...], dtype=object)`.
+
+## Format of csv
+```
+video_id,feature_file
+Z8xhli297v8,Z8xhli297v8
+...
+```
+Note: The video_id and feature_file are the same for the consistency and our historical compatibility. We use feature_file to get the feature from feature pickle.
+
+## Preprocess on MSRVTT
+The s3d feature extraction is the same as HowTo100M introduced above.
+The data can be downloaded in: [TBD]