Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

This is the official repository of NeurIPS 2023 paper VALOR.

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Yung-Hsuan Lai, Yen-Chun Chen, Yu-Chiang Frank Wang

Machine environment

Ubuntu version: 20.04.6 LTS
CUDA version: 11.4
Testing GPU: NVIDIA GeForce RTX 3090

Requirements

A conda environment named valor can be created and activated with:

conda env create -f environment.yaml
conda activate valor

Dataset setup

We recommend directly work with our pre-extracted features for simplicity. If you still wish to run feature extraction and pseudo-label generation yourself, please follow the below optional instructions.

Video downloading (optional)

One may download the videos listed in a csv file with the following command. We provide an example.csv to demonstrate the process of downloading a video.

python ./utils/download_dataset.py --videos_saved_dir ./data/raw_videos --label_all_dataset ./data/example.csv

After downloading the videos to data/raw_videos/, one should extract video frames in 8fps from each video with the following command. We show the frame extraction of the video downloaded from the above command.

python ./utils/extract_frames.py --video_path ./data/raw_videos --out_dir ./data/video_frames

However, there are some potential issues when using these downloaded videos. For more details, please see here.

LLP dataset annotations

Please download LLP dataset annotations (6 csv files) from AVVP-ECCV20 and put in data/.

Pre-extracted features

Please download audio features (VGGish), 2D visual features (ResNet152), and 3D visual features (ResNet (2+1)D) from AVVP-ECCV20 and put in data/feats/

CLIP features & segment-level pseudo labels

Option 1. Please download visual features and segment-level pseudo labels from CLIP from this Google Drive link, put in data/, and unzip the file with the following command:

unzip CLIP.zip

Option 2

The visual features and pseudo labels can also be generated with the following command if the video frames from all videos are prepared in data/video_frames/:

python ./utils/clip_preprocess.py --print_progress

For example, the features and pseudo labels of the video downloaded above can be generated in data/example_v_feats/ and data/example_v_labels/, respectively, with the following command:

python ./utils/clip_preprocess.py --label_all_dataset ./data/example.csv --pseudo_labels_saved_dir ./data/example_v_labels --visual_feats_saved_dir ./data/example_v_feats --print_progress

CLAP features & segment-level pseudo labels

Option 1. Please download audio features and segment-level pseudo labels from CLAP from this Google Drive link, put in data/, and unzip the file with the following command:

unzip CLAP.zip

Option 2

The audio features and pseudo labels can also be generated with the following command if raw videos are prepared in data/raw_videos/:

python ./utils/clap_preprocess.py --print_progress

For example, the features and pseudo labels of the video downloaded above can be generated in data/example_a_feats/ and data/example_a_labels/, respectively, with the following command:

python ./utils/clap_preprocess.py --label_all_dataset ./data/example.csv --pseudo_labels_saved_dir ./data/example_a_labels --audio_feats_saved_dir ./data/example_a_feats --print_progress

File structure for dataset and code

Please make sure that the file structure is the same as the following.

File structure

> data/
    ├── AVVP_dataset_full.csv
    ├── AVVP_eval_audio.csv
    ├── AVVP_eval_visual.csv
    ├── AVVP_test_pd.csv
    ├── AVVP_train.csv
    ├── AVVP_val_pd.csv
    ├── feats/
    │     └── r2plus1d_18/
    │     └── res152/
    │     └── vggish/
    ├── CLIP/
    │     └── features/
    │     └── segment_pseudo_labels/
    ├── CLAP/
    │     └── features/
    │     └── segment_pseudo_labels/
    ├── video_frames/ (optional)
    │     └── 00BDwKBD5i8/
    │     └── 00fs8Gpipss/
    │     └── ...
    └── raw_videos/ (optional)
          └── 00BDwKBD5i8.mp4
          └── 00fs8Gpipss.mp4
          └── ...

Download trained models

Please download the trained models from this Google Drive link and put the models in their corresponding model directory.

File structure

> models/
    ├── model_VALOR/
    │        └── checkpoint_best.pt
    ├── model_VALOR+/
    │        └── checkpoint_best.pt
    └── model_VALOR++/
             └── checkpoint_best.pt

Training

We provide some sample scripts for training VALOR, VALOR+, and VALOR++.

VALOR

bash scripts/train_valor.sh

VALOR+

bash scripts/train_valor+.sh

VALOR++

bash scripts/train_valor++.sh

Evaluation

Note: If you are evaluating the trained models generated with the training scripts, change --model_name model_VALOR to --model_name model_VALOR_reproduce here.

VALOR

bash scripts/test_valor.sh

VALOR+

bash scripts/test_valor+.sh

VALOR++

bash scripts/test_valor++.sh

Acknowledgement

We build VALOR codebase heavily on the codebase of AVVP-ECCV20 and MM-Pyramid. We sincerely thank the authors for open-sourcing! We also thank CLIP and CLAP for open-sourcing pre-trained models.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{lai2023modality,
  title={Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser},
  author={Yung-Hsuan Lai, Yen-Chun Chen, Yu-Chiang Frank Wang},
  booktitle={NeurIPS},
  year={2023}
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
figures		figures
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataloader_avvp.py		dataloader_avvp.py
environment.yaml		environment.yaml
main_avvp.py		main_avvp.py
main_network.py		main_network.py
video_issues.md		video_issues.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Machine environment

Requirements

Dataset setup

LLP dataset annotations

Pre-extracted features

CLIP features & segment-level pseudo labels

CLAP features & segment-level pseudo labels

File structure for dataset and code

Download trained models

Training

VALOR

VALOR+

VALOR++

Evaluation

VALOR

VALOR+

VALOR++

Acknowledgement

Citation

License

About

Releases

Packages

Languages

License

Franklin905/VALOR

Folders and files

Latest commit

History

Repository files navigation

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Machine environment

Requirements

Dataset setup

LLP dataset annotations

Pre-extracted features

CLIP features & segment-level pseudo labels

CLAP features & segment-level pseudo labels

File structure for dataset and code

Download trained models

Training

VALOR

VALOR+

VALOR++

Evaluation

VALOR

VALOR+

VALOR++

Acknowledgement

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages