The main model is composed of a pretrained convolutional encoder to extract features and a transformer decoder to generate caption. For more information, please refer to the corresponding DCASE task page.
This repository includes:
- AAC model trained on the Clotho dataset
- Extract features using ConvNeXt
- System reaches 29.6% SPIDEr-FL score on Clotho-eval (development-testing)
- Output detailed training characteristics (number of parameters, MACs, energy consumption...)
First, you need to create an environment that contains python>=3.11 and pip. You can use venv, conda, micromamba or other python environment tool.
Here is an example with micromamba:
micromamba env create -n env_dcase24 python=3.11 pip -c defaults
micromamba activate env_dcase24
Then, you can clone this repository and install it:
git clone https://github.com/Labbeti/dcase2024-task6-baseline
cd dcase2024-task6-baseline
pip install -e .
pre-commit install
You also need to install Java >= 1.8 and <= 1.13 on your machine to compute AAC metrics. If needed, you can override java executable path with the environment variable AAC_METRICS_JAVA_PATH
.
To download, extract and process data, you need to run:
dcase24t6-prepare
By default, the dataset is stored in ./data
directory. It will requires approximatively 33GB of disk space.
dcase24t6-train +expt=baseline
By default, the model and results are saved in directory ./logs/SAVE_NAME
. SAVE_NAME
is the name of the script with the starting date.
Metrics are computed at the end of the training with the best checkpoint.
dcase24t6-test resume=./logs/SAVE_NAME
or specify each path separtely:
dcase24t6-test resume=null model.checkpoint_path=./logs/SAVE_NAME/checkpoints/MODEL.ckpt tokenizer.path=./logs/SAVE_NAME/tokenizer.json
You need to replace SAVE_NAME
by the save directory name and MODEL
by the checkpoint filename.
If you want to load and test the baseline pretrained weights, you can specify the baseline checkpoint weights:
dcase24t6-test resume=~/.cache/torch/hub/checkpoints/dcase2024-task6-baseline
If you want to test the baseline model on a single file, you can use the baseline_pipeline
function:
from dcase24t6.nn.hub import baseline_pipeline
sr = 44100
audio = torch.rand(1, sr * 15)
model = baseline_pipeline()
item = {"audio": audio, "sr": sr}
outputs = model(item)
candidate = outputs["candidates"][0]
print(candidate)
The source code extensively use PyTorch Lightning for training and Hydra for configuration. It is highly recommanded to learn about them if you want to understand this code.
Installation has three main steps:
- Download external models (ConvNeXt to extract audio features)
- Download Clotho dataset using aac-datasets
- Create HDF files containing each Clotho subset with preprocessed audio features using torchoutil
Training follows the standard way to create a model with lightning:
- Initialize callbacks, tokenizer, datamodule, model.
- Start fitting the model on the specified datamodule.
- Evaluate the model using aac-metrics
The model outperforms previous baselines with a SPIDEr-FL score of 29.6% on the Clotho evaluation subset. The captioning model architecture is described in this paper and called CNext-trans. The encoder part (ConvNeXt) is described in more detail in this paper.
The pretrained weights of the AAC model are available on Zenodo: ConvNeXt encoder (BL_AC), Transformer decoder. Both weights are automatically downloaded during dcase24t6-prepare
.
Hyperparameter | Value | Option |
---|---|---|
Number of epochs | 400 | trainer.max_epochs |
Batch size | 64 | datamodule.batch_size |
Gradient accumulation | 8 | trainer.accumulate_grad_batches |
Learning rate | 5e-4 | model.lr |
Weight decay | 2 | model.weight_decay |
Gradient clipping | 1 | trainer.gradient_clip_val |
Beam size | 3 | model.beam_size |
Model dimension size | 256 | model.d_model |
Label smoothing | 0.2 | model.label_smoothing |
Mixup alpha | 0.4 | model.mixup_alpha |
Metric | Score on Clotho-eval |
---|---|
BLEU-1 | 0.5948 |
BLEU-2 | 0.3924 |
BLEU-3 | 0.2603 |
BLEU-4 | 0.1695 |
METEOR | 0.1897 |
ROUGE-L | 0.3927 |
CIDEr-D | 0.4619 |
SPICE | 0.1335 |
SPIDEr | 0.2977 |
SPIDEr-FL | 0.2962 |
SBERT-sim | 0.5059 |
FER | 0.0038 |
FENSE | 0.5040 |
BERTScore | 0.9766 |
Vocabulary (words) | 551 |
Here is also an estimation of the number of parameters and multiply-accumulate operations (MACs) during inference for the audio file "Santa Motor.wav":
Name | Params (M) | MACs (G) |
---|---|---|
Encoder | 29.4 | 44.4 |
Decoder | 11.9 | 4.3 |
Total | 41.3 | 48.8 |
-
Modify the model. The model class is located in
src/dcase24t6/models/trans_decoder.py
. It is recommanded to create another class and conf to keep different models architectures. The loss is computed in the method calledtraining_step
. You can also modify the model architecture in the method calledsetup
. -
Extract different audio features. For that, you can add a new pre-process function in
src/dcase24t6/pre_processes
and the related conf insrc/conf/pre_process
. Then, re-rundcase24t6-prepare pre_process=YOUR_PROCESS download_clotho=false
to create new HDF files with your own features. To train a new model on these features, you can specify the HDF files required indcase24t6-train datamodule.train_hdfs=clotho_dev_YOUR_PROCESS.hdf datamodule.val_hdfs=... datamodule.test_hdfs=... datamodule.predict_hdfs=...
. Depending on the features extracted, some parameters could be modified in the model to handle them. -
Using as a package. If you do not want ot use the entire codebase but only parts of it, you can install it as a package using:
pip install git+https://github.com/Labbeti/dcase2024-task6-baseline
Then you will be able to import any object from the code like for example from dcase24t6.models.trans_decoder import TransDecoderModel
. There is also several important dependencies that you can install separately:
aac-datasets
to download and load AAC datasets,aac-metrics
to compute AAC metrics,torchoutil[extras]
to pack datasets to HDF files.
- The code has been made for Ubuntu 20.04 and should work on more recent Ubuntu versions and Linux-based distributions.
- The GPU used is NVIDIA GeForce RTX 2080 Ti (11GB VRAM). Training lasts for approximatively 2h30m in the default setting.
- In this code, clotho subsets are named according to the Clotho convention, not the DCASE convention. See more information on this page.
- DCASE2023 Audio Captioning baseline
- DCASE2022 Audio Captioning baseline
- DCASE2021 Audio Captioning baseline
- DCASE2020 Audio Captioning baseline
- aac-datasets
- aac-metrics
Maintainer:
- Étienne Labbé "Labbeti": labbeti.pub@gmail.com