Combining audio control and style transfer using latent diffusion

Official repository for Combining audio control and style transfer using latent diffusion by Nils Demerlé, Philippe Esling, Guillaume Doras and David Genova accepted at ISMIR 2024 (paper link).

This diffusion-based generative model creates new audio by blending two inputs: one audio sample that sets the style or timbre, and another input (either audio or MIDI) that defines the structure over time. In this repository, you will find instructions to train your own model as well as model checkpoints trained on the two datasets presented in the paper.

We are currently working on a real-time implementation of this model called AFTER. You can already experiment with a real-time version of the model in MaxMSP on the official AFTER repository.

Model training

Prior to training, install the required dependancies using :

pip install -r "requirements.txt"

Training the model requires three steps : processing the dataset, training an autoencoder, then training the diffusion model.

Dataset preparation

python dataset/split_to_lmdb.py --input_path /path/to/audio_dataset --output_path /path/to/audio_dataset/out_lmdb

Or to use slakh with midi processing (after downloading Slakh2100 here) :

python dataset/split_to_lmdb.py --input_path /path/to/slakh --output_path /path/to/slakh/out_lmdb_midi --slakh True

Autoencoder training

python train_autoencoder.py --name my_autoencoder --db_path /path/to/lmdb --gpu #

Once the autoencoder is trained, it must be exported to a torchscript .pt file :

 python export_autoencoder.py --name my_autoencoder --step ##

It is possible to skip this whole phase and use a pretrained autoencoder such as Encodec, wrapped in a nn.module with encode and decode methods.

Diffusion model training

The model training is configured with gin config files.

To train the audio to audio model :

 python train_diffusion.py --name my_audio_model --db_path /path/to/lmdb --config main --dataset_type waveform --gpu #

To train the midi-to-audio model :

 python train_diffusion.py --name my_midi_audio_model  --db_path /path/to/lmdb_midi --config midi --dataset_type midi --gpu #

Inference and pretrained models

Three pretrained models are currently available :

Audio to audio transfer model trained on Slakh
Audio to audio transfer model trained on multiple datasets (Maestro, URMP, Filobass, GuitarSet...)
MIDI-to-audio model trained on Slakh

You can download the autoencoder and diffusion model checkpoints here. Make sure you copy the pretrained models in ./pretrained. The notebooks in ./notebooks demonstrate how to load a model and generate audio from midi and audio files.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
audios		audios
autoencoder		autoencoder
dataset		dataset
diffusion		diffusion
images		images
notebooks		notebooks
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
_config.yml		_config.yml
export_autoencoder.py		export_autoencoder.py
index.md		index.md
requirements.txt		requirements.txt
train_autoencoder.py		train_autoencoder.py
train_diffusion.py		train_diffusion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Combining audio control and style transfer using latent diffusion

Model training

Dataset preparation

Autoencoder training

Diffusion model training

Inference and pretrained models

About

Releases

Packages

Contributors 2

Languages

License

NilsDem/control-transfer-diffusion

Folders and files

Latest commit

History

Repository files navigation

Combining audio control and style transfer using latent diffusion

Model training

Dataset preparation

Autoencoder training

Diffusion model training

Inference and pretrained models

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages