🍓 Ichigo and 🍰 Ichigo-Whisper.

Homebrewed early-fusion speech model and ASR model

Note

Update: December 30, 2024

Released Ichigo-Whisper v0.1: a 22M-parameter quantizer built on Whisper Medium for Vietnamese and English.
Open-source, optimized for low-resource languages, using discrete tokens for LLM integration and advanced speech understanding.

Warning

🍓 Ichigo and 🍰 Ichigo-Whisper are open research experiments

Join us in the #research channel in Homebrew's Discord
We livestream training runs in #research-livestream

About

🍓 Ichigo is an open, ongoing research experiment to extend a text-based LLM to have native "listening" ability. Think of it as an open data, open weight, on device Siri.

It uses an early fusion technique inspired by Meta's Chameleon paper.

We ~~build~~ train in public:

Architecture

Overview architecture

Demo

WebUI

For instructions on how to self-host the Ichigo web UI demo using Docker, please visit: Ichigo demo. To try our demo on a single RTX 4090 GPU, you can go directly to: https://ichigo.homebrew.ltd

Gradio Web UI

We offer code for users to create a web UI demo. Please follow the instructions below:

python -m venv demo
source demo/bin/activate
# First install all required packages
pip install --no-cache-dir -r ./demo/requirements.txt

Then run the command below to launch a Gradio demo locally. You can add the variables use-4bit and use-8bit for quantized usage:

python -m demo.app --host 0.0.0.0 --port 7860 --max-seq-len 1024

You can also host a demo using vLLM for faster inference but its not support streaming output:

python -m demo.app_vllm

Alternatively, you can easily try our demo on HuggingFace 🤗

Progress

Latest Update

30 Dec: Ichigo Whisper is now available. It is a lightweight (22M parameters), open-source quantizer built on top of Whisper Medium, designed to optimize performance for low-resource languages while maintaining strong English capabilities. Unlike continuous embedding models, Ichigo Whisper compresses speech into discrete tokens, enabling seamless integration with large language models (LLMs) for advanced speech understanding.

View Full History

11 Nov: Ichigo v0.4 models are now available. This update introduces a unified training pipeline by consolidating Phases 2 and 3, with training data enhancements that include migrating speech noise and multi-turn data to Phase 2 and adding synthetic noise-augmented multi-turn conversations. Achieving an improved MMLU score of 64.63, the model now boasts stronger context handling, advanced noise management, and enhanced multi-turn capabilities for a more robust and responsive user experience.
22 Oct: 📑 Research Paper Release: We are pleased to announce the publication of our research paper detailing the development and technical innovations behind Ichigo series. The full technical details, methodology, and experimental results are now available in our paper.
4 Oct: Ichigo v0.3 models are now available. Utilizing cleaner and improved data, our model has achieved an enhanced MMLU score of 63.79 and demonstrates stronger speech instruction-following capabilities, even in multi-turn interactions. Additionally, by incorporating noise-synthetic data, we have successfully trained the model to refuse processing non-speech audio inputs from users, further improving its functionality and user experience.
23 Aug: We're excited to share Ichigo-llama3.1-s-instruct-v0.2, our latest multimodal checkpoint with improved speech understanding by enhancing the model's audio instruction-following capabilities through training on interleaving synthetic data.
17 Aug: We pre-trained our LLaMA 3.1 model on continuous speech data, tokenized using WhisperSpeechVQ. The final loss converged to approximately 1.9, resulting in our checkpoint: Ichigo-llama3.1-s-base-v0.2
1 Aug: Identified typo in original training recipe, causing significant degradation (MMLU: 0.6 -> 0.2), proposed fixes.
30 July: Presented llama3-s progress at: AI Training: From PyTorch to GPU Clusters
19 July: llama3-s-2024-07-19 understands synthetic voice with limited results
1 July: llama3-s-2024-07-08 showed converging loss (1.7) with limited data

Data Synthetic and Training Instruction

Synthetic Generation

For detailed information on synthetic generation, please refer to the Synthetic Generation Guide.

Organize Directory

First Clone the Repo from Github:

git clone --recurse-submodules https://github.com/homebrewltd/ichigo.git

The folder structure is as follows:

Ichigo
├── demo                                     # Gradio demo
├── images                                   # Project images and assets
├── inference                                # Inference code
├── latency_testing                          # Benchmarking code
├── scripts                                  # Gradio demo and utility scripts
├── synthetic_data                           # Data generation and torch_compile debugging
├── external                                 # External dependencies
   ├── ichigo-whisper                       # WhisperSpeech/ichigo-whisper submodule
   └── torchtune                            # Training utilities submodule

Training with Torchtune

Install Package

python -m venv torchtune
pip install torch torchvision torchao tensorboard
mkdir model_zoo
cd ./torchtune
pip install -e .

Logging Huggingface:

huggingface-cli login --token=<token>

Download the tokenizer.model and the required model using the tune in the ichigo/model_zoo directory:

tune download homebrewltd/llama3.1-s-whispervq-init --output-dir ../model_zoo/llama3.1-s-whispervq-init --ignore-patterns "original/consolidated*"

[NOTE] : In case you want to use different base model, you can uploaded your own resized embedding model to Hugging Face Hub:

# folder containing the checkpoint files
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu", torch_dtype=torch.bfloat16) 
tokenizer = AutoTokenizer.from_pretrained(model_name)
sound_tokens = [f'<|sound_{num:04d}|>' for num in range(513)]
special_tokens = ["<|sound_start|>", "<|sound_end|>"]
num_added_tokens = tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})
tokenizer.add_tokens(sound_tokens)
model.resize_token_embeddings(len(tokenizer))
model.push_to_hub("<your_hf>/Llama3.1-s-whispervq-init")
tokenizer.push_to_hub("<your_hf>/Llama3.1-s-whispervq-init")

Pretraining Multi GPU (1-8GPUs Supported)

tune run --nproc_per_node <no-gpu> full_finetune_fsdp2 --config recipes/configs/jan-llama3-1-s/pretrain/8B_full.yaml

[NOTE] : After training finished, please use this script to convert checkpoint to format that can be loaded by HF transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import HfApi, HfFolder
import torch
import os
import glob
from tqdm import tqdm

# folder containing the checkpoint files
output_dir = "../model_zoo/llama3-1-s-base"
pt_to_merge = glob.glob(f"{output_dir}/hf_model_000*_1.pt")
state_dicts = [torch.load(p) for p in tqdm(pt_to_merge)]
merged_state_dicts = {k: v for d in state_dicts for k, v in d.items()}
torch.save(merged_state_dicts, f"{output_dir}/pytorch_model.bin")
model = AutoModelForCausalLM.from_pretrained(output_dir, torch_dtype=torch.bfloat16)
print(model)
tokenizer_path = "homebrewltd/llama3.1-s-whispervq-init"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
# Save the updated model and tokenizer locally
tokenizer.save_pretrained(output_dir)
model.push_to_hub("<your_hf>/Llama3.1-s-base")
tokenizer.push_to_hub("<your_hf>/Llama3.1-s-base")

Instruction Tuning

Download checkpoint from huggingface using the tune or use your local pretrained checkpoint located at model_zoo/llama3-1-s-base:
```
tune run --nproc_per_node <no-gpu> full_finetune_fsdp2 --config recipes/configs/jan-llama3-1-s/finetune/8B_full.yaml
```

Ichigo Whisper

Ichigo Whisper is a compact (22M parameters), open-source speech tokenizer for the Whisper-medium model, designed to enhance performance on multilingual with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ichigo Whisper compresses speech into discrete tokens, making it more compatible with large language models (LLMs) for immediate speech understanding.

This speech tokenizer has been trained on over ~400 hours of English data and ~1000 hours of Vietnamese data.

Ichigo Whisper is a key component of the Ichigo v0.5 family.

For more details, please refer to our official Ichigo Whisper Repository.

References

@misc{chameleonteam2024chameleonmixedmodalearlyfusionfoundation,
      title={Chameleon: Mixed-Modal Early-Fusion Foundation Models}, 
      author={Chameleon Team},
      year={2024},
      eprint={2405.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      journal={arXiv preprint}
}

@misc{zhang2024adamminiusefewerlearning,
      title={Adam-mini: Use Fewer Learning Rates To Gain More}, 
      author={Yushun Zhang and Congliang Chen and Ziniu Li and Tian Ding and Chenwei Wu and Yinyu Ye and Zhi-Quan Luo and Ruoyu Sun},
      year={2024},
      eprint={2406.16793},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      journal={arXiv preprint}
}

@misc{defossez2022highfi,
      title={High Fidelity Neural Audio Compression},
      author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
      year={2022},
      eprint={2210.13438},
      archivePrefix={arXiv},
      journal={arXiv preprint}
}

@misc{WhisperSpeech,
      title={WhisperSpeech: An Open Source Text-to-Speech System Built by Inverting Whisper}, 
      author={Collabora and LAION},
      year={2024},
      url={https://github.com/collabora/WhisperSpeech},
      note={GitHub repository}
}

Join Us

🍓 Ichigo and 🍰 Ichigo-Whisper is an open research project. We're looking for collaborators, and will likely move towards crowdsourcing speech datasets in the future.

Acknowledgement

Torchtune: The codebase we built upon
Accelerate: Library for easy use of distributed training
WhisperSpeech: Text-to-speech model for synthetic audio generation
Encodec: High-fidelity neural audio codec for efficient audio compression
Llama3: the Family of Models that we based on that has the amazing language capabilities !!!

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
demo		demo
ichigo-whisper @ 8f6347a		ichigo-whisper @ 8f6347a
images		images
inference		inference
latency_testing		latency_testing
scripts		scripts
synthetic_data		synthetic_data
torchtune @ 3b8f818		torchtune @ 3b8f818
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍓 Ichigo and 🍰 Ichigo-Whisper.

About

Architecture

Demo

WebUI

Gradio Web UI

Progress

Data Synthetic and Training Instruction

Synthetic Generation

Organize Directory

Training with Torchtune

Ichigo Whisper

References

Join Us

Acknowledgement

About

Releases 3

Packages

Contributors 9

Languages

License

janhq/ichigo

Folders and files

Latest commit

History

Repository files navigation

🍓 Ichigo and 🍰 Ichigo-Whisper.

About

Architecture

Demo

WebUI

Gradio Web UI

Progress

Data Synthetic and Training Instruction

Synthetic Generation

Organize Directory

Training with Torchtune

Ichigo Whisper

References

Join Us

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 9

Languages

Packages