Skip to content

Tracking states of the arts and recent results (bibliography) on sound tasks.

Notifications You must be signed in to change notification settings

soham97/sound_ai_progress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 

Repository files navigation

Sound AI progress

Tracking states of the arts and recent results (bibliography) on sound AI topics and audio tasks. Feel free to create PRs for new results!

Inspired by wer_are_we and are_we_there_yet

Sound AI or Audio Analytics

Sound AI or Audio Analytics focuses on analyzing and understanding audio signals captured by digital devices, with numerous applications in health & wellbeing, environmental sensing, urban living, and the creative sector.

Table of Contents

Sound Event Classification

AudioSet

Title Notes mAP Paper Code
BEATs: Audio Pre-Training with Acoustic Tokenizers iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers [ensemble] 0.506 chen22 πŸ“œ
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST [ensemble] 0.496 koutini22 πŸ“œ
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules [ensemble] 0.487 chen2022 πŸ“œ
BEATs: Audio Pre-Training with Acoustic Tokenizers iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers 0.486 chen22 πŸ“œ
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet [ensemble] 0.485 gong2021 πŸ“œ
Masked Autoencoders that Listen extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms 0.473 huang2022 πŸ“œ
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST [non-ensemble] 0.471 koutini22 πŸ“œ
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules [non-ensemble] 0.471 chen2022 πŸ“œ
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet [non-ensemble] 0.459 gong2021 πŸ“œ
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition CNN models trained on AudioSet 0.439 kong2019 πŸ“œ
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks Conformer-based self-supervised learning 0.415 srivastava2022

FSD50K

Title Notes mAP Paper Code
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST 0.653 koutini22 πŸ“œ
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 0.649 wu2022 πŸ“œ
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 0.5859 elizalde2022 πŸ“œ
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP 0.4308 wu2021 πŸ“œ

ESC50

Title Notes Accuracy Paper Code
BEATs: Audio Pre-Training with Acoustic Tokenizers iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers 98.1% chen22 πŸ“œ
Masked Autoencoders that Listen Image-based MAE for audio spectrograms 97.4% huang2022 πŸ“œ
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules 97.00% chen2022 πŸ“œ
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST 96.8% koutini22 πŸ“œ
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 96.70% elizalde2022 πŸ“œ
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet 95.70% gong2021 πŸ“œ
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer A Transformer model pretrained w/ visual image supervision 95.70% zhao2022 πŸ“œ
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition Multi-stage sequential learning with knowledge transfer from Audioset 94.10% kumar2020
Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications CNN model pretrained on AudioSet 92.32% lopez-meyer2021
Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks Pretrained model with multi-channel features 89.50% kim2020 πŸ“œ
An Ensemble of Convolutional Neural Networks for Audio Classification CNN ensemble with data augmentation 88.65% nanni2020 πŸ“œ
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices CNN model (ACDNet) with potential compression 87.1% mohaimenuzzaman2021 πŸ“œ
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies 86.50% sailor2017
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP 85.95% wu2021 πŸ“œ
AclNet: efficient end-to-end audio classification CNN CNN with mixup and data augmentation 85.65% huang2018
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications x-vector network with openll3 embeddings 85.00% wilkinghoff2020
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning 84.90% tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies 84.15% tak2017
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes CNN pretrained on AudioSet 83.50% kumar2017 πŸ“œ
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM + fusion with GTSC 83.00% sailor2017
Deep Multimodal Clustering for Unsupervised Audiovisual Learning CNN + unsupervised audio-visual learning 82.60% hu2019
Novel TEO-based Gammatone Features for Environmental Sound Classification Fusion of GTSC & TEO-GTSC with CNN 81.95% agrawal2017
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + Between-Class learning 81.80% tokozume2017b
🎧 Human accuracy Crowdsourcing experiment in classifying ESC-50 by human listeners 81.30% piczak2015a πŸ“œ
Objects that Sound Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule 79.80% arandjelovic2017b
Look, Listen and Learn 8-layer convolutional subnetwork pretrained on an audio-visual correspondence task 79.30% arandjelovic2017a
Learning Environmental Sounds with Multi-scale Convolutional Neural Network Multi-scale convolutions with feature fusion (waveform + spectrogram) 79.10% zhu2018
Novel TEO-based Gammatone Features for Environmental Sound Classification GTSC with CNN 79.10% agrawal2017
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + data augmentation 78.80% tokozume2017b
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM 78.45% sailor2017
Learning from Between-class Examples for Deep Sound Recognition Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning 76.90% tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound Classification TEO-GTSC with CNN 74.85% agrawal2017
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) 74.40% tokozume2017b
Soundnet: Learning sound representations from unlabeled video 8-layer CNN (raw audio) with transfer learning from unlabeled videos 74.20% aytar2016 πŸ“œ
Learning from Between-class Examples for Deep Sound Recognition 18-layer CNN on raw waveforms (dai2016) + Between-Class learning 73.30% tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification CNN working with phase encoded mel filterbank energies (PEFBEs) 73.25% tak2017
Classifying environmental sounds using image recognition networks 16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length) 73.20% boddapati2017 πŸ“œ
Learning from Between-class Examples for Deep Sound Recognition Baseline CNN (piczak2015b) + Batch Normalization 72.40% tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound Classification Fusion of MFCC & TEO-GTCC with GMM 72.25% agrawal2017
Learning environmental sounds with end-to-end convolutional neural network (EnvNet) Combination of spectrogram and raw waveform CNN 71.00% tokozume2017a
Novel TEO-based Gammatone Features for Environmental Sound Classification TEO-GTCC with GMM 68.85% agrawal2017
Classifying environmental sounds using image recognition networks 16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) 68.70% boddapati2017 πŸ“œ
Very Deep Convolutional Neural Networks for Raw Waveforms 18-layer CNN on raw waveforms 68.50% dai2016, tokozume2017b πŸ“œ
Classifying environmental sounds using image recognition networks 32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length) 67.80% boddapati2017 πŸ“œ
WSNet: Learning Compact and Efficient Networks with Weight Sampling SoundNet 8-layer CNN architecture with 100x model compression 66.25% jin2017
Soundnet: Learning sound representations from unlabeled video 5-layer CNN (raw audio) with transfer learning from unlabeled videos 66.10% aytar2016 πŸ“œ
WSNet: Learning Compact and Efficient Networks with Weight Sampling SoundNet 8-layer CNN architecture with 180x model compression 65.80% jin2017
Soundnet: Learning sound representations from unlabeled video 5-layer CNN trained on raw audio of ESC-50 only 65.00% aytar2016 πŸ“œ
πŸ“Š Environmental Sound Classification with Convolutional Neural Networks - CNN baseline CNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer 64.50% piczak2015b πŸ“œ
auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks MLP classifier on features extracted with an RNN autoencoder 64.30% freitag2017 πŸ“œ
Classifying environmental sounds using image recognition networks 32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) 63.20% boddapati2017 πŸ“œ
Classifying environmental sounds using image recognition networks CRNN 60.30% boddapati2017 πŸ“œ
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks 3-layer CNN with vertical filters on wideband mel-STFT (median accuracy) 56.37% huzaifah2017
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks 3-layer CNN with square filters on wideband mel-STFT (median accuracy) 54.00% huzaifah2017
Soundnet: Learning sound representations from unlabeled video 8-layer CNN trained on raw audio of ESC-50 only 51.10% aytar2016 πŸ“œ
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks 5-layer CNN with square filters on wideband mel-STFT (median accuracy) 50.87% huzaifah2017
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks 5-layer CNN with vertical filters on wideband mel-STFT (median accuracy) 46.25% huzaifah2017
πŸ“Š Baseline - random forest Baseline ML approach (MFCC & ZCR + random forest) 44.30% piczak2015a πŸ“œ
Soundnet: Learning sound representations from unlabeled video Convolutional autoencoder trained on unlabeled videos 39.90% aytar2016 πŸ“œ
πŸ“Š Baseline - SVM Baseline ML approach (MFCC & ZCR + SVM) 39.60% piczak2015a πŸ“œ
πŸ“Š Baseline - k-NN Baseline ML approach (MFCC & ZCR + k-NN) 32.20% piczak2015a πŸ“œ

US8K

Title Notes Accuracy Paper Code
AudioCLIP: Extending CLIP to Image, Text and Audio incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset 90.07% guzhov2021 πŸ“œ
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 87.96% elizalde2022 πŸ“œ
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP 81.01% wu2021 πŸ“œ

VocalSound

Title Notes Accuracy Paper Code
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 97.95% elizalde2022 πŸ“œ
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition EfficientNetB0 90.5% gong2022 πŸ“œ

VGGSound

Title Notes Accuracy Paper Code
Slow-Fast Auditory Streams For Audio Recognition two-stream convolutional network for audio recognition 54.4% kazakos2022 πŸ“œ
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP 46.63% wu2021 πŸ“œ

Acoustic Scene Classification

Audio Captioning

AudioCaps

Title Notes SPIDEr Paper Code
Audio Captioning Transformer Transformer network based on an encoder-decoder architecture 0.426 mei2021 πŸ“œ

Clotho

Title Notes SPIDEr Paper Code
WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information two-branch audio encoder for learning temporal and local time-frequency information 0.182 tran2020 πŸ“œ

Text to Audio Retrieval

AudioCaps

Title Notes mAP@10 R@1 Paper Code
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 36.7 wu2022 πŸ“œ
Audio Retrieval with Natural Language Queries: A Benchmark Study MoE, CE and MMT used 36.1 koepke2022 πŸ“œ
Audio Retrieval with WavText5K and CLAP Training CLAP training with WavText5K added 49.45 34.69 deshmukh2022 πŸ“œ
On metric learning for audio-text cross-modal retrieval Metric learning objectives for audio retrieval 33.9 mei2022 πŸ“œ

Clotho

Title Notes mAP@10 R@1 Paper Code
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 18.2 wu2022 πŸ“œ
Audio Retrieval with WavText5K and CLAP Training CLAP training with WavText5K added 27.12 16.75 deshmukh2022 πŸ“œ
On metric learning for audio-text cross-modal retrieval Metric learning objectives for audio retrieval 14.4 mei2022 πŸ“œ
Audio Retrieval with Natural Language Queries: A Benchmark Study MoE, CE and MMT used 6.7 koepke2022 πŸ“œ

Audio to Text Retrieval

AudioCaps

Title Notes mAP@10 R@1 Paper Code
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 46.8 wu2022 πŸ“œ
Audio Retrieval with WavText5K and CLAP Training CLAP training with WavText5K added 30.81 41.91 deshmukh2022 πŸ“œ
On metric learning for audio-text cross-modal retrieval Metric learning objectives for audio retrieval 39.6 mei2022 πŸ“œ
Audio Retrieval with Natural Language Queries: A Benchmark Study MoE, CE and MMT used 39.6 koepke2022 πŸ“œ

Clotho

Title Notes mAP@10 R@1 Paper Code
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 25.7 wu2022 πŸ“œ
Audio Retrieval with WavText5K and CLAP Training CLAP training with WavText5K added 13.65 20.00 deshmukh2022 πŸ“œ
On metric learning for audio-text cross-modal retrieval Metric learning objectives for audio retrieval 16.9 mei2022 πŸ“œ
Audio Retrieval with Natural Language Queries: A Benchmark Study MoE, CE and MMT used 7.2 koepke2022 πŸ“œ

Music Classification

GTZAN Genres

Title Notes Accuracy Paper Code
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 91.3% elizalde2022 πŸ“œ
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST [HEAR Challenge] 88.3% koutini22 πŸ“œ
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition CNN models trained on AudioSet [HEAR Challenge] 86.0% kong2019 πŸ“œ
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP [HEAR Challenge] 74.8% wu2021 πŸ“œ

GTZAN Music Speech

Title Notes Accuracy Paper Code
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 100% elizalde2022 πŸ“œ
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition CNN models trained on AudioSet [HEAR Challenge] 99.23% kong2019 πŸ“œ
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST [HEAR Challenge] 97.69% koutini22 πŸ“œ
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP [HEAR Challenge] 94.55% wu2021 πŸ“œ

Glossary

SED: Sound Event Detection
ASC: Acoustic Scene Classification