Skip to content

Labbeti/dcase2024-task6-baseline

Repository files navigation

dcase2024-task6-baseline

DCASE2024 Challenge Task 6 baseline system of Automated Audio Captioning (AAC)

Python PyTorch Code style: black Build

The main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption. For more information, please refer to the corresponding DCASE task page.

This repository includes:

  • AAC model trained on the Clotho dataset
  • Extract features using ConvNeXt
  • System reaches 29.6% SPIDEr-FL score on Clotho-eval (development-testing)
  • Output detailed training characteristics (number of parameters, MACs, energy consumption...)

Installation

First, you need to create an environment that contains python>=3.11 and pip. You can use venv, conda, micromamba or other python environment tool.

Here is an example with micromamba:

micromamba env create -n env_dcase24 python=3.11 pip -c defaults
micromamba activate env_dcase24

Then, you can clone this repository and install it:

git clone https://github.com/Labbeti/dcase2024-task6-baseline
cd dcase2024-task6-baseline
pip install -e .
pre-commit install

You also need to install Java >= 1.8 and <= 1.13 on your machine to compute AAC metrics. If needed, you can override java executable path with the environment variable AAC_METRICS_JAVA_PATH.

Usage

Download external data, models and prepare

To download, extract and process data, you need to run:

dcase24t6-prepare

By default, the dataset is stored in ./data directory. It will requires approximatively 33GB of disk space.

Train the default model

dcase24t6-train +expt=baseline

By default, the model and results are saved in directory ./logs/SAVE_NAME. SAVE_NAME is the name of the script with the starting date. Metrics are computed at the end of the training with the best checkpoint.

Test a pretrained model

dcase24t6-test resume=./logs/SAVE_NAME

or specify each path separtely:

dcase24t6-test resume=null model.checkpoint_path=./logs/SAVE_NAME/checkpoints/MODEL.ckpt tokenizer.path=./logs/SAVE_NAME/tokenizer.json

You need to replace SAVE_NAME by the save directory name and MODEL by the checkpoint filename.

If you want to load and test the baseline pretrained weights, you can specify the baseline checkpoint weights:

dcase24t6-test resume=~/.cache/torch/hub/checkpoints/dcase2024-task6-baseline

Inference on a file

If you want to test the baseline model on a single file, you can use the baseline_pipeline function:

from dcase24t6.nn.hub import baseline_pipeline

sr = 44100
audio = torch.rand(1, sr * 15)

model = baseline_pipeline()
item = {"audio": audio, "sr": sr}
outputs = model(item)
candidate = outputs["candidates"][0]

print(candidate)

Code overview

The source code extensively use PyTorch Lightning for training and Hydra for configuration. It is highly recommanded to learn about them if you want to understand this code.

Installation has three main steps:

  • Download external models (ConvNeXt to extract audio features)
  • Download Clotho dataset using aac-datasets
  • Create HDF files containing each Clotho subset with preprocessed audio features using torchoutil

Training follows the standard way to create a model with lightning:

  • Initialize callbacks, tokenizer, datamodule, model.
  • Start fitting the model on the specified datamodule.
  • Evaluate the model using aac-metrics

Model

The model outperforms previous baselines with a SPIDEr-FL score of 29.6% on the Clotho evaluation subset. The captioning model architecture is described in this paper and called CNext-trans. The encoder part (ConvNeXt) is described in more detail in this paper.

The pretrained weights of the AAC model are available on Zenodo: ConvNeXt encoder (BL_AC), Transformer decoder. Both weights are automatically downloaded during dcase24t6-prepare.

Main hyperparameters

Hyperparameter Value Option
Number of epochs 400 trainer.max_epochs
Batch size 64 datamodule.batch_size
Gradient accumulation 8 trainer.accumulate_grad_batches
Learning rate 5e-4 model.lr
Weight decay 2 model.weight_decay
Gradient clipping 1 trainer.gradient_clip_val
Beam size 3 model.beam_size
Model dimension size 256 model.d_model
Label smoothing 0.2 model.label_smoothing
Mixup alpha 0.4 model.mixup_alpha

Detailed results

Metric Score on Clotho-eval
BLEU-1 0.5948
BLEU-2 0.3924
BLEU-3 0.2603
BLEU-4 0.1695
METEOR 0.1897
ROUGE-L 0.3927
CIDEr-D 0.4619
SPICE 0.1335
SPIDEr 0.2977
SPIDEr-FL 0.2962
SBERT-sim 0.5059
FER 0.0038
FENSE 0.5040
BERTScore 0.9766
Vocabulary (words) 551

Here is also an estimation of the number of parameters and multiply-accumulate operations (MACs) during inference for the audio file "Santa Motor.wav":

Name Params (M) MACs (G)
Encoder 29.4 44.4
Decoder 11.9 4.3
Total 41.3 48.8

Tips

  • Modify the model. The model class is located in src/dcase24t6/models/trans_decoder.py. It is recommanded to create another class and conf to keep different models architectures. The loss is computed in the method called training_step. You can also modify the model architecture in the method called setup.

  • Extract different audio features. For that, you can add a new pre-process function in src/dcase24t6/pre_processes and the related conf in src/conf/pre_process. Then, re-run dcase24t6-prepare pre_process=YOUR_PROCESS download_clotho=false to create new HDF files with your own features. To train a new model on these features, you can specify the HDF files required in dcase24t6-train datamodule.train_hdfs=clotho_dev_YOUR_PROCESS.hdf datamodule.val_hdfs=... datamodule.test_hdfs=... datamodule.predict_hdfs=.... Depending on the features extracted, some parameters could be modified in the model to handle them.

  • Using as a package. If you do not want ot use the entire codebase but only parts of it, you can install it as a package using:

pip install git+https://github.com/Labbeti/dcase2024-task6-baseline

Then you will be able to import any object from the code like for example from dcase24t6.models.trans_decoder import TransDecoderModel. There is also several important dependencies that you can install separately:

  • aac-datasets to download and load AAC datasets,
  • aac-metrics to compute AAC metrics,
  • torchoutil[extras] to pack datasets to HDF files.

Additional information

  • The code has been made for Ubuntu 20.04 and should work on more recent Ubuntu versions and Linux-based distributions.
  • The GPU used is NVIDIA GeForce RTX 2080 Ti (11GB VRAM). Training lasts for approximatively 2h30m in the default setting.
  • In this code, clotho subsets are named according to the Clotho convention, not the DCASE convention. See more information on this page.

See also

Contact

Maintainer: