Large-Audio-Models

We keep track of something big in the audio domain, including speech, singing, music etc.

Moshi: a speech-text foundation model for real-time dialogue(2024.9) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. [PDF][Code]
LLaMA-Omni: Seamless Speech Interaction with Large Language Models(2024.9) by Qingkai Fang et al. [PDF][Code]
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming(2024.8) by Zhifei Xie et al. [PDF][Code]
SpeechGPT: Speech Large Language Models(2023.5) by Dong Zhang et al. [PDF][Code]

Prompt-based Audio Synthesis

M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models(2023), Atin Sakkeer Hussain et al. [PDF]
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer(2023), Xiaofei Wang et al. [PDF]
TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model(2023), Deepanway Ghosal et al. [PDF]
Diverse and Vivid Sound Generation from Text Descriptions(2023), Guangwei Li et al. [PDF]
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers(2023), Kai Shen et al. [PDF]
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models(2023), Yuancheng Wang et al. [PDF]
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos(2023), Kun Su et al. [PDF]
FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model(2023), Ruiqing Xue et al. [PDF]
VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (2023), Ziqiang Zhang et al. [PDF]
Simple and Controllable Music Generation(2023), Jade Copet et al. [PDF]
Efficient Neural Music Generation(2023), Max W. Y. Lam et al. [PDF]
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models(2023), Pengfei Zhu et al. [PDF]
Noise2Music: Text-conditioned Music Generation with Diffusion Models(2023), Qingqing Huang et al. [PDF]
Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision(2023), Eugene Kharitonov et al. [PDF]
SingSong: Generating musical accompaniments from singing(2023), Chris Donahue et al. [PDF]
MusicLM: Generating Music From Text(2023), Andrea Agostinelli et al. [PDF]
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (2023), Dongchao Yang et al. [PDF]
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation(2023), Rongjie Huang et al. [PDF]
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models(2023), Haohe Liu et al. [PDF]
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion(2023), Flavio Schneider et al. [PDF]
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models(2023), Jiawei Huang et al. [PDF]
ArchiSound: Audio Generation with Diffusion(2023), Flavio Schneider. [PDF]
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023), Chengyi Wang et al. [PDF]
PromptTTS: Controllable Text-to-Speech with Text Descriptions(2022), Zhifang Guo et al. [PDF]
Diffsound: Discrete Diffusion Model for Text-to-sound Generation(2022), Dongchao Yang et al. [PDF]

Audio Language Models

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models(2023), Yunfei Chu et al. [PDF]
UniAudio: An Audio Foundation Model Toward Universal Audio Generation(2023), Dongchao Yang et al. [PDF]
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models(2023), Xin Zhang et al. [PDF]
SoundStorm: Efficient Parallel Audio Generation(2023), Zalán Borsos et al. [PDF]
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head(2023), Rongjie Huang et al. [PDF]
AudioPaLM: A Large Language Model That Can Speak and Listen(2023), Paul K. Rubenstein et al. [PDF]
Pengi: An Audio Language Model for Audio Tasks(2023), Soham Deshmukh et al. [PDF]
AudioLM: a Language Modeling Approach to Audio Generation(2022), Zalán Borsos et al. [PDF]

Audio SSL and UL models

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations(2019), Alexei Baevski et al. [PDF]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (2020), Alexei Baevski et al. [PDF]
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training (2021) [PDF]
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (2021) Wei-Ning Hsu et al. [PDF]
Data2vec: A general framework for self-supervised learning in speech, vision and language (2022), Alexei Baevski et al. [PDF]
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets (2022), Ziyang Ma et al. [PDF]
ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers (2022), Kaizhi Qian et al. [PDF]
Data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language (2022), Alexei Baevski et al. [PDF]
MuLan: A Joint Embedding of Music Audio and Natural Language (2022) Qingqing Huang et al. [PDF]

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.tags		.tags
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large-Audio-Models

Contents

Spoken Language Models

Prompt-based Audio Synthesis

Audio Language Models

Audio SSL and UL models

About

Releases

Packages

Contributors 9

liusongxiang/Large-Audio-Models

Folders and files

Latest commit

History

Repository files navigation

Large-Audio-Models

Contents

Spoken Language Models

Prompt-based Audio Synthesis

Audio Language Models

Audio SSL and UL models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Packages