dcase2024-task6-baseline

DCASE2024 Challenge Task 6 baseline system of Automated Audio Captioning (AAC)

The main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption. For more information, please refer to the corresponding DCASE task page.

This repository includes:

AAC model trained on the Clotho dataset
Extract features using ConvNeXt
System reaches 29.6% SPIDEr-FL score on Clotho-eval (development-testing)
Output detailed training characteristics (number of parameters, MACs, energy consumption...)

Installation

First, you need to create an environment that contains python>=3.11 and pip. You can use venv, conda, micromamba or other python environment tool.

Here is an example with micromamba:

micromamba env create -n env_dcase24 python=3.11 pip -c defaults
micromamba activate env_dcase24

Then, you can clone this repository and install it:

git clone https://github.com/Labbeti/dcase2024-task6-baseline
cd dcase2024-task6-baseline
pip install -e .
pre-commit install

You also need to install Java >= 1.8 and <= 1.13 on your machine to compute AAC metrics. If needed, you can override java executable path with the environment variable AAC_METRICS_JAVA_PATH.

Usage

Download external data, models and prepare

To download, extract and process data, you need to run:

dcase24t6-prepare

By default, the dataset is stored in ./data directory. It will requires approximatively 33GB of disk space.

Train the default model

dcase24t6-train +expt=baseline

By default, the model and results are saved in directory ./logs/SAVE_NAME. SAVE_NAME is the name of the script with the starting date. Metrics are computed at the end of the training with the best checkpoint.

Test a pretrained model

dcase24t6-test resume=./logs/SAVE_NAME

or specify each path separtely:

dcase24t6-test resume=null model.checkpoint_path=./logs/SAVE_NAME/checkpoints/MODEL.ckpt tokenizer.path=./logs/SAVE_NAME/tokenizer.json

You need to replace SAVE_NAME by the save directory name and MODEL by the checkpoint filename.

If you want to load and test the baseline pretrained weights, you can specify the baseline checkpoint weights:

dcase24t6-test resume=~/.cache/torch/hub/checkpoints/dcase2024-task6-baseline

Inference on a file

If you want to test the baseline model on a single file, you can use the baseline_pipeline function:

from dcase24t6.nn.hub import baseline_pipeline

sr = 44100
audio = torch.rand(1, sr * 15)

model = baseline_pipeline()
item = {"audio": audio, "sr": sr}
outputs = model(item)
candidate = outputs["candidates"][0]

print(candidate)

Code overview

The source code extensively use PyTorch Lightning for training and Hydra for configuration. It is highly recommanded to learn about them if you want to understand this code.

Installation has three main steps:

Download external models (ConvNeXt to extract audio features)
Download Clotho dataset using aac-datasets
Create HDF files containing each Clotho subset with preprocessed audio features using torchoutil

Training follows the standard way to create a model with lightning:

Initialize callbacks, tokenizer, datamodule, model.
Start fitting the model on the specified datamodule.
Evaluate the model using aac-metrics

Model

The model outperforms previous baselines with a SPIDEr-FL score of 29.6% on the Clotho evaluation subset. The captioning model architecture is described in this paper and called CNext-trans. The encoder part (ConvNeXt) is described in more detail in this paper.

The pretrained weights of the AAC model are available on Zenodo: ConvNeXt encoder (BL_AC), Transformer decoder. Both weights are automatically downloaded during dcase24t6-prepare.

Main hyperparameters

Hyperparameter	Value	Option
Number of epochs	400	`trainer.max_epochs`
Batch size	64	`datamodule.batch_size`
Gradient accumulation	8	`trainer.accumulate_grad_batches`
Learning rate	5e-4	`model.lr`
Weight decay	2	`model.weight_decay`
Gradient clipping	1	`trainer.gradient_clip_val`
Beam size	3	`model.beam_size`
Model dimension size	256	`model.d_model`
Label smoothing	0.2	`model.label_smoothing`
Mixup alpha	0.4	`model.mixup_alpha`

Detailed results

Metric	Score on Clotho-eval
BLEU-1	0.5948
BLEU-2	0.3924
BLEU-3	0.2603
BLEU-4	0.1695
METEOR	0.1897
ROUGE-L	0.3927
CIDEr-D	0.4619
SPICE	0.1335
SPIDEr	0.2977
SPIDEr-FL	0.2962
SBERT-sim	0.5059
FER	0.0038
FENSE	0.5040
BERTScore	0.9766
Vocabulary (words)	551

Here is also an estimation of the number of parameters and multiply-accumulate operations (MACs) during inference for the audio file "Santa Motor.wav":

Name	Params (M)	MACs (G)
Encoder	29.4	44.4
Decoder	11.9	4.3
Total	41.3	48.8

Tips

Modify the model. The model class is located in src/dcase24t6/models/trans_decoder.py. It is recommanded to create another class and conf to keep different models architectures. The loss is computed in the method called training_step. You can also modify the model architecture in the method called setup.
Extract different audio features. For that, you can add a new pre-process function in src/dcase24t6/pre_processes and the related conf in src/conf/pre_process. Then, re-run dcase24t6-prepare pre_process=YOUR_PROCESS download_clotho=false to create new HDF files with your own features. To train a new model on these features, you can specify the HDF files required in dcase24t6-train datamodule.train_hdfs=clotho_dev_YOUR_PROCESS.hdf datamodule.val_hdfs=... datamodule.test_hdfs=... datamodule.predict_hdfs=.... Depending on the features extracted, some parameters could be modified in the model to handle them.
Using as a package. If you do not want ot use the entire codebase but only parts of it, you can install it as a package using:

pip install git+https://github.com/Labbeti/dcase2024-task6-baseline

Then you will be able to import any object from the code like for example from dcase24t6.models.trans_decoder import TransDecoderModel. There is also several important dependencies that you can install separately:

aac-datasets to download and load AAC datasets,
aac-metrics to compute AAC metrics,
torchoutil[extras] to pack datasets to HDF files.

Additional information

The code has been made for Ubuntu 20.04 and should work on more recent Ubuntu versions and Linux-based distributions.
The GPU used is NVIDIA GeForce RTX 2080 Ti (11GB VRAM). Training lasts for approximatively 2h30m in the default setting.
In this code, clotho subsets are named according to the Clotho convention, not the DCASE convention. See more information on this page.

Contact

Maintainer:

Étienne Labbé "Labbeti": labbeti.pub@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dcase2024-task6-baseline

Installation

Usage

Download external data, models and prepare

Train the default model

Test a pretrained model

Inference on a file

Code overview

Model

Main hyperparameters

Detailed results

Tips

Additional information

See also

Contact

About

Releases 2

Languages

License

Labbeti/dcase2024-task6-baseline

Folders and files

Latest commit

History

Repository files navigation

dcase2024-task6-baseline

Installation

Usage

Download external data, models and prepare

Train the default model

Test a pretrained model

Inference on a file

Code overview

Model

Main hyperparameters

Detailed results

Tips

Additional information

See also

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages