Tracking states of the arts and recent results (bibliography) on sound AI topics and audio tasks. Feel free to create PRs for new results!
Inspired by wer_are_we and are_we_there_yet
Sound AI or Audio Analytics focuses on analyzing and understanding audio signals captured by digital devices, with numerous applications in health & wellbeing, environmental sensing, urban living, and the creative sector.
- Sound Event Classification
- Acoustic Scene Classification
- Audio Captioning
- Text to Audio Retrieval
- Audio to Text Retrieval
- Music Classification
Title | Notes | mAP | Paper | Code |
---|---|---|---|---|
BEATs: Audio Pre-Training with Acoustic Tokenizers | iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers [ensemble] | 0.506 | chen22 | π |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST [ensemble] | 0.496 | koutini22 | π |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules [ensemble] | 0.487 | chen2022 | π |
BEATs: Audio Pre-Training with Acoustic Tokenizers | iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers | 0.486 | chen22 | π |
AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet [ensemble] | 0.485 | gong2021 | π |
Masked Autoencoders that Listen | extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms | 0.473 | huang2022 | π |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST [non-ensemble] | 0.471 | koutini22 | π |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules [non-ensemble] | 0.471 | chen2022 | π |
AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet [non-ensemble] | 0.459 | gong2021 | π |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | CNN models trained on AudioSet | 0.439 | kong2019 | π |
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks | Conformer-based self-supervised learning | 0.415 | srivastava2022 |
Title | Notes | mAP | Paper | Code |
---|---|---|---|---|
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST | 0.653 | koutini22 | π |
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 0.649 | wu2022 | π |
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 0.5859 | elizalde2022 | π |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 0.4308 | wu2021 | π |
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
BEATs: Audio Pre-Training with Acoustic Tokenizers | iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers | 98.1% | chen22 | π |
Masked Autoencoders that Listen | Image-based MAE for audio spectrograms | 97.4% | huang2022 | π |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules | 97.00% | chen2022 | π |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST | 96.8% | koutini22 | π |
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 96.70% | elizalde2022 | π |
AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet | 95.70% | gong2021 | π |
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer | A Transformer model pretrained w/ visual image supervision | 95.70% | zhao2022 | π |
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition | Multi-stage sequential learning with knowledge transfer from Audioset | 94.10% | kumar2020 | |
Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications | CNN model pretrained on AudioSet | 92.32% | lopez-meyer2021 | |
Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks | Pretrained model with multi-channel features | 89.50% | kim2020 | π |
An Ensemble of Convolutional Neural Networks for Audio Classification | CNN ensemble with data augmentation | 88.65% | nanni2020 | π |
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices | CNN model (ACDNet) with potential compression | 87.1% | mohaimenuzzaman2021 | π |
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies | 86.50% | sailor2017 | |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 85.95% | wu2021 | π |
AclNet: efficient end-to-end audio classification CNN | CNN with mixup and data augmentation | 85.65% | huang2018 | |
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications | x-vector network with openll3 embeddings | 85.00% | wilkinghoff2020 | |
Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning | 84.90% | tokozume2017b | |
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification | CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies | 84.15% | tak2017 | |
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes | CNN pretrained on AudioSet | 83.50% | kumar2017 | π |
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM + fusion with GTSC | 83.00% | sailor2017 | |
Deep Multimodal Clustering for Unsupervised Audiovisual Learning | CNN + unsupervised audio-visual learning | 82.60% | hu2019 | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | Fusion of GTSC & TEO-GTSC with CNN | 81.95% | agrawal2017 | |
Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + Between-Class learning | 81.80% | tokozume2017b | |
π§ Human accuracy | Crowdsourcing experiment in classifying ESC-50 by human listeners | 81.30% | piczak2015a | π |
Objects that Sound | Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule | 79.80% | arandjelovic2017b | |
Look, Listen and Learn | 8-layer convolutional subnetwork pretrained on an audio-visual correspondence task | 79.30% | arandjelovic2017a | |
Learning Environmental Sounds with Multi-scale Convolutional Neural Network | Multi-scale convolutions with feature fusion (waveform + spectrogram) | 79.10% | zhu2018 | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | GTSC with CNN | 79.10% | agrawal2017 | |
Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + data augmentation | 78.80% | tokozume2017b | |
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM | 78.45% | sailor2017 | |
Learning from Between-class Examples for Deep Sound Recognition | Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning | 76.90% | tokozume2017b | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | TEO-GTSC with CNN | 74.85% | agrawal2017 | |
Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) | 74.40% | tokozume2017b | |
Soundnet: Learning sound representations from unlabeled video | 8-layer CNN (raw audio) with transfer learning from unlabeled videos | 74.20% | aytar2016 | π |
Learning from Between-class Examples for Deep Sound Recognition | 18-layer CNN on raw waveforms (dai2016) + Between-Class learning | 73.30% | tokozume2017b | |
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification | CNN working with phase encoded mel filterbank energies (PEFBEs) | 73.25% | tak2017 | |
Classifying environmental sounds using image recognition networks | 16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length) | 73.20% | boddapati2017 | π |
Learning from Between-class Examples for Deep Sound Recognition | Baseline CNN (piczak2015b) + Batch Normalization | 72.40% | tokozume2017b | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | Fusion of MFCC & TEO-GTCC with GMM | 72.25% | agrawal2017 | |
Learning environmental sounds with end-to-end convolutional neural network (EnvNet) | Combination of spectrogram and raw waveform CNN | 71.00% | tokozume2017a | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | TEO-GTCC with GMM | 68.85% | agrawal2017 | |
Classifying environmental sounds using image recognition networks | 16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) | 68.70% | boddapati2017 | π |
Very Deep Convolutional Neural Networks for Raw Waveforms | 18-layer CNN on raw waveforms | 68.50% | dai2016, tokozume2017b | π |
Classifying environmental sounds using image recognition networks | 32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length) | 67.80% | boddapati2017 | π |
WSNet: Learning Compact and Efficient Networks with Weight Sampling | SoundNet 8-layer CNN architecture with 100x model compression | 66.25% | jin2017 | |
Soundnet: Learning sound representations from unlabeled video | 5-layer CNN (raw audio) with transfer learning from unlabeled videos | 66.10% | aytar2016 | π |
WSNet: Learning Compact and Efficient Networks with Weight Sampling | SoundNet 8-layer CNN architecture with 180x model compression | 65.80% | jin2017 | |
Soundnet: Learning sound representations from unlabeled video | 5-layer CNN trained on raw audio of ESC-50 only | 65.00% | aytar2016 | π |
π Environmental Sound Classification with Convolutional Neural Networks - CNN baseline | CNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer | 64.50% | piczak2015b | π |
auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks | MLP classifier on features extracted with an RNN autoencoder | 64.30% | freitag2017 | π |
Classifying environmental sounds using image recognition networks | 32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) | 63.20% | boddapati2017 | π |
Classifying environmental sounds using image recognition networks | CRNN | 60.30% | boddapati2017 | π |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 3-layer CNN with vertical filters on wideband mel-STFT (median accuracy) | 56.37% | huzaifah2017 | |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 3-layer CNN with square filters on wideband mel-STFT (median accuracy) | 54.00% | huzaifah2017 | |
Soundnet: Learning sound representations from unlabeled video | 8-layer CNN trained on raw audio of ESC-50 only | 51.10% | aytar2016 | π |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 5-layer CNN with square filters on wideband mel-STFT (median accuracy) | 50.87% | huzaifah2017 | |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 5-layer CNN with vertical filters on wideband mel-STFT (median accuracy) | 46.25% | huzaifah2017 | |
π Baseline - random forest | Baseline ML approach (MFCC & ZCR + random forest) | 44.30% | piczak2015a | π |
Soundnet: Learning sound representations from unlabeled video | Convolutional autoencoder trained on unlabeled videos | 39.90% | aytar2016 | π |
π Baseline - SVM | Baseline ML approach (MFCC & ZCR + SVM) | 39.60% | piczak2015a | π |
π Baseline - k-NN | Baseline ML approach (MFCC & ZCR + k-NN) | 32.20% | piczak2015a | π |
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
AudioCLIP: Extending CLIP to Image, Text and Audio | incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset | 90.07% | guzhov2021 | π |
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 87.96% | elizalde2022 | π |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 81.01% | wu2021 | π |
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 97.95% | elizalde2022 | π |
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition | EfficientNetB0 | 90.5% | gong2022 | π |
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
Slow-Fast Auditory Streams For Audio Recognition | two-stream convolutional network for audio recognition | 54.4% | kazakos2022 | π |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 46.63% | wu2021 | π |
Title | Notes | SPIDEr | Paper | Code |
---|---|---|---|---|
Audio Captioning Transformer | Transformer network based on an encoder-decoder architecture | 0.426 | mei2021 | π |
Title | Notes | SPIDEr | Paper | Code |
---|---|---|---|---|
WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information | two-branch audio encoder for learning temporal and local time-frequency information | 0.182 | tran2020 | π |
Title | Notes | mAP@10 | R@1 | Paper | Code |
---|---|---|---|---|---|
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 36.7 | wu2022 | π | |
Audio Retrieval with Natural Language Queries: A Benchmark Study | MoE, CE and MMT used | 36.1 | koepke2022 | π | |
Audio Retrieval with WavText5K and CLAP Training | CLAP training with WavText5K added | 49.45 | 34.69 | deshmukh2022 | π |
On metric learning for audio-text cross-modal retrieval | Metric learning objectives for audio retrieval | 33.9 | mei2022 | π |
Title | Notes | mAP@10 | R@1 | Paper | Code |
---|---|---|---|---|---|
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 18.2 | wu2022 | π | |
Audio Retrieval with WavText5K and CLAP Training | CLAP training with WavText5K added | 27.12 | 16.75 | deshmukh2022 | π |
On metric learning for audio-text cross-modal retrieval | Metric learning objectives for audio retrieval | 14.4 | mei2022 | π | |
Audio Retrieval with Natural Language Queries: A Benchmark Study | MoE, CE and MMT used | 6.7 | koepke2022 | π |
Title | Notes | mAP@10 | R@1 | Paper | Code |
---|---|---|---|---|---|
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 46.8 | wu2022 | π | |
Audio Retrieval with WavText5K and CLAP Training | CLAP training with WavText5K added | 30.81 | 41.91 | deshmukh2022 | π |
On metric learning for audio-text cross-modal retrieval | Metric learning objectives for audio retrieval | 39.6 | mei2022 | π | |
Audio Retrieval with Natural Language Queries: A Benchmark Study | MoE, CE and MMT used | 39.6 | koepke2022 | π |
Title | Notes | mAP@10 | R@1 | Paper | Code |
---|---|---|---|---|---|
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 25.7 | wu2022 | π | |
Audio Retrieval with WavText5K and CLAP Training | CLAP training with WavText5K added | 13.65 | 20.00 | deshmukh2022 | π |
On metric learning for audio-text cross-modal retrieval | Metric learning objectives for audio retrieval | 16.9 | mei2022 | π | |
Audio Retrieval with Natural Language Queries: A Benchmark Study | MoE, CE and MMT used | 7.2 | koepke2022 | π |
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 91.3% | elizalde2022 | π |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST [HEAR Challenge] | 88.3% | koutini22 | π |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | CNN models trained on AudioSet [HEAR Challenge] | 86.0% | kong2019 | π |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP [HEAR Challenge] | 74.8% | wu2021 | π |
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 100% | elizalde2022 | π |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | CNN models trained on AudioSet [HEAR Challenge] | 99.23% | kong2019 | π |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST [HEAR Challenge] | 97.69% | koutini22 | π |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP [HEAR Challenge] | 94.55% | wu2021 | π |
SED: Sound Event Detection
ASC: Acoustic Scene Classification