Releases: homebrewltd/ichigo
Releases · homebrewltd/ichigo
First release of 🍓 Ichigo!
Model weight can be downloaded at:
Changelog: v0.2 vs v0.3
Overall Comparison
Phase | Aspect | v0.2 | v0.3 |
---|---|---|---|
Pretraining | Data Size | 2.42M | 3.87M |
Data Source | parler-tts/mls_eng_10k | facebook/multilingual_librispeech | |
Data Synthetic Pipeline | Using WhisperVQ(old checkpoint: whisper-vq-stoks-medium-en+pl.model) to tokenize english-only audio. | Using latest checkpoint whisper-vq-stoks-v3-7lang.model for 8 lang audio. | |
Epoch | 1 | 1 | |
Global batch size | 480 | 480 | |
Learning Rate | 2e-4 | 2e-4 | |
Warmup Steps | 80 | 50 | |
Weight Decay | 0.005 | 0.005 | |
Max length | 512 | 512 | |
Precision | bf16 | bf16 | |
Instruction Phase | Data Size | 929K | 1.89M + 165k (phase 3) |
Preprocessing | Using rule-base to remove all hard-to-pronounce prompt | Utilizing rule-based methods to filter out hard-to-pronounce prompts, and rephrasing certain LLM-generated responses to sound more natural and human-like. | |
Data Synthetic Pipeline | Using old text-to-speech checkpoint to generate: t2s-small-yt.model then using whisper-vq-stoks-medium-en+pl.model to tokenize audio. | Change t2s checkpoint to t2s-v1.1-small-en+pl.model and whisperVQ checkpoint to whisper-vq-stoks-v3-7lang.model. | |
Epoch | 5 | 1 | |
Global batch size | 128 | 256 | |
Gradient Acc Step per device | 1 | 8 | |
Learning Rate | 1e-4 | 7e-5 and 1.5e-5 for phase 3 | |
Warmup Steps | 80 | 73 and 8 for phase 3 | |
Weight Decay | 0.005 | 0.005 | |
Max length | 1024 | 4096 | |
Precision | bf16 | bf16 |
Instruction Phase Data Task Types
Task Type | v0.2 | v0.3 |
---|---|---|
Speech Multiturn | None | 150k(Mostly 2 turns around 10k >=4 turns |
Speech QA | 679k samples | 1.332M samples |
Transcription | 250k samples(Using a special token to denote a transcription task) | 400k samples(Using 6 different prompts) |
Noise Audio | None | 8k samples(Using Qwen2.5-72B to generate diverse synthetic answers for randomly generated sound tokens, with lengths matching the distribution of the Speech QA prompt) |
Text-only | None | 150k samples including: 100k multiturn + 50k single turn |