🇨🇳中文 | 🌐English | 📖文档/Docs | ❓提问/Issues | 💬讨论/Discussions | ⚔️竞技场/Arena
This project is developed based on the Mixtral model released by Mistral.ai, which utilizes a Sparse Mixture of Experts (MoE) architecture. This project involves the use of large-scale Chinese unannotated data for incremental training in Chinese, resulting in the Chinese Mixtral base model. Further fine-tuning with instructions led to the creation of the Chinese Mixtral-Instruct instruction model. This model natively supports a 32K context (tested up to 128K) and is capable of effectively processing long texts, while also showing significant performance improvements in areas like mathematical reasoning and code generation. When using llama.cpp for quantized inference, a minimum of only 16GB of memory (or VRAM) is required.
Paper: [Cui and Yao, 2024] Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral [Blog (in Chinese)]
- 🚀 Open-sourced Chinese Mixtral base model, incrementally trained in Chinese on top of Mixtral-8x7B-v0.1
- 🚀 Open-sourced Chinese Mixtral-Instruct instruction model, further fine-tuned based on the Chinese Mixtral
- 🚀 Open-sourced pre-training scripts and fine-tuning scripts for instructions, enabling users to further train or fine-tune the model as needed
- 🚀 Tutorial for quick local deployment and quantization of large models using personal computer CPU/GPU
- 🚀 Supports 🤗transformers, llama.cpp, text-generation-webui, LangChain, privateGPT, vLLM and other Mixtral ecosystem components
Chinese LLaMA-2 & Alpaca-2 Large Models | Chinese LLaMA & Alpaca Large Models | Multimodal Chinese LLaMA & Alpaca Large Models | Multimodal VLE | Chinese MiniRBT | Chinese LERT | Chinese-English PERT | Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | Knowledge Distillation Tool TextBrewer | Model Pruning Tool TextPruner | Distillation and Pruning Integrated GRAIN
[Apr 30, 2024] Chinese-LLaMA-Alpaca-3 project introduces Llama-3-Chinese-8B and Llama-3-Chinese-8B-Instruct, based on Meta's Llama-3. Check: https://github.com/ymcui/Chinese-LLaMA-Alpaca-3
[Mar 27, 2024] Add 1-bit/2-bit/3-bit GGUF models: [🤗HF]; Meanwhile, this project has been added in the SOTA! model platform of Synced, welcome to follow: https://sota.jiqizhixin.com/project/chinese-mixtral**
[Mar 26, 2024] Add deployment method of OpenAI-style API. See: 📚v1.2 Release Notes
[Mar 5, 2024] Release pre-training and fine-tuning scripts. Technical reports are also available. See: 📚 v1.1 Release Notes
[Jan 29, 2024] 🚀 Official release of Chinese-Mixtral (Base Model), Chinese-Mixtral-Instruct (Instruction/Chat Model). For more details, see: 📚 v1.0 Release Notes
Chapter | Description |
---|---|
💁🏻♂️ Model Introduction | Brief introduction to the technical features of the models related to this project |
⏬ Model Download | Download address for Chinese Mixtral large model |
💻 Inference and Deployment | How to quantify the model and deploy it using a personal computer |
💯 Model Performance | Introduction to the model's performance in certain tasks |
📝 Training and Fine-tuning | How to train and fine-tune the Chinese Mixtral large model |
❓ Frequently Asked Questions | Responses to some common questions |
This project open-sources the Chinese Mixtral and Chinese Mixtral-Instruct models developed based on the Mixtral model, with the following main features:
Mixtral is a Sparse Mixture of Experts model. This model significantly differs from mainstream large models like LLaMA in several aspects:
- Each FFN layer contains 8 different "experts" (fully connected layers), with the best 2 activated based on gating values.
- Every token in the input sequence independently selects an expert, rather than the entire sequence corresponding to a set of experts.
- The actual parameter count is about 46.7B, with around 13B activated during inference.
Below is a structural diagram from the Mixtral paper:
Unlike the Chinese-LLaMA-Alpaca and Chinese-LLaMA-Alpaca-2 projects, the Mixtral model natively supports a 32K context (tested up to 128K). Users can use a single model to solve various tasks of different lengths.
Here is a comparison of the models in this project and the recommended use cases. For chat interactions, please choose the Instruct version.
Comparison Item | Chinese Mixtral | Chinese Mixtral-Instruct |
---|---|---|
Model Type | Base Model | Instruction/Chat Model (akin to ChatGPT) |
Model Size | 8x7B (about 13B activated) | 8x7B (about 13B activated) |
Number of Experts | 8 (2 activated) | 8 (2 activated) |
Training Type | Causal-LM (CLM) | Instruction fine-tuning |
Training Method | QLoRA + Full emb/lm-head | QLoRA + Full emb/lm-head |
Based on Which Model | Original Mixtral-8x7B-v0.1 | Chinese Mixtral |
Training Corpus | Unannotated general corpus | Annotated instruction data |
Vocabulary Size | Original vocabulary, 32000 | Original vocabulary, 32000 |
Supported Context Length | 32K (tested up to 128K) | 32K (tested up to 128K) |
Input Template | Not required | Required to apply Mixtral-Instruct template |
Applicable Scenarios | Text continuation | QA, chat, etc. |
Three different types of models are provided below:
- Full Version Model: Can be used directly without any merging steps, recommended for users with sufficient network bandwidth.
- LoRA Version Model: Cannot be used alone, must be merged with the original Mixtral-8x7B-v0.1 to convert into a full version model. Recommended for users with limited network bandwidth who already have the original Mixtral. For merging method, please refer to: 💻 Model Merging Steps
- GGUF Version Model: A GGUF quantized version model compatible with tools like llama.cpp, recommended for users who only need to perform inference deployment.
Model Name | Type | Setting | Full Version (87 GB) | LoRA Version (2.4 GB) | GGUF Version |
---|---|---|---|---|---|
Chinese-Mixtral | Base Model | 8x7B | [Baidu] [🤗HF] [🤖ModelScope] |
[Baidu] [🤗HF] [🤖ModelScope] |
[🤗HF] |
Chinese-Mixtral-Instruct | Instruction Model | 8x7B | [Baidu] [🤗HF] [🤖ModelScope] |
[Baidu] [🤗HF] [🤖ModelScope] |
[🤗HF] |
Note
If you are unable to access HF, consider using some mirror sites (like hf-mirror.com), please find the method yourself.
The related models in this project mainly support the following quantization, inference, and deployment methods, please refer to the respective tutorials for specific content.
Tool | Features | CPU | GPU | Quantization | GUI | API | vLLM | Tutorial |
---|---|---|---|---|---|---|---|---|
llama.cpp | Rich quantization options and efficient local inference | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | [link] |
🤗Transformers | Native transformers inference interface | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [link] |
Imitation OpenAI API Call | Server Demo with OpenAI API-like interface | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | [link] |
text-generation-webui | Frontend Web UI deployment method | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [link] |
LangChain | Open-source framework for large model applications suitable for secondary development | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | [link] |
privateGPT | Local multi-document Q&A framework | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | [link] |
LM Studio | Multi-platform chat software (with interface) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [link] |
To evaluate the effectiveness of the related models, this project conducted both generative effect evaluation and objective effect evaluation (NLU category), assessing large models from different perspectives. Users are recommended to test on the tasks they are interested in and choose models that are best suited for those tasks.
- This project, inspired by Fastchat Chatbot Arena, has launched an online model battle platform to browse and evaluate the quality of model responses. The battle platform provides evaluation metrics such as win rate and Elo rating, and one can view the win rates of model matchups. ⚔️ Model Arena: http://llm-arena.ymcui.com
- The examples directory provides output samples of Chinese-Mixtral-Instruct and Chinese-Alpaca-2-13B, and compares scores using GPT-4, with Chinese-Mixtral-Instruct averaging a score of 8.20 and Chinese-Alpaca-2-13B averaging 7.05. 📄 Output Sample Comparison: examples
C-Eval is a comprehensive Chinese baseline model evaluation suite, where the validation and test sets contain 1.3K and 12.3K multiple-choice questions, respectively, covering 52 subjects. For C-Eval inference code, please refer to this project: 📖GitHub Wiki
Models | Type | Valid (0-shot) | Valid (5-shot) | Test (0-shot) | Test (5-shot) |
---|---|---|---|---|---|
Chinese-Mixtral-Instruct | chat | 51.7 | 55.0 | 50.0 | 51.5 |
Chinese-Mixtral | base | 45.8 | 54.2 | 43.1 | 49.1 |
Mixtral-8x7B-Instruct-v0.1 | chat | 51.6 | 54.0 | 48.7 | 50.7 |
Mixtral-8x7B-v0.1 | base | 47.3 | 54.6 | 46.1 | 50.3 |
Chinese-Alpaca-2-13B | chat | 44.3 | 45.9 | 42.6 | 44.0 |
Chinese-LLaMA-2-13B | base | 40.6 | 42.7 | 38.0 | 41.6 |
CMMLU is another comprehensive Chinese evaluation dataset specifically designed to assess the knowledge and reasoning ability of language models in Chinese contexts. It covers 67 topics from basic subjects to advanced professional levels, with a total of 11.5K multiple-choice questions. For CMMLU inference code, please refer to this project: 📖GitHub Wiki
Models | Type | Test (0-shot) | Test (5-shot) |
---|---|---|---|
Chinese-Mixtral-Instruct | chat | 50.0 | 53.0 |
Chinese-Mixtral | base | 42.5 | 51.0 |
Mixtral-8x7B-Instruct-v0.1 | chat | 48.2 | 51.6 |
Mixtral-8x7B-v0.1 | base | 44.3 | 51.6 |
Chinese-Alpaca-2-13B | chat | 43.2 | 45.5 |
Chinese-LLaMA-2-13B | base | 38.9 | 42.5 |
MMLU is an English evaluation dataset for assessing natural language understanding abilities. It is one of the main datasets used today for evaluating the capabilities of large models. The validation and test sets contain 1.5K and 14.1K multiple-choice questions, respectively, covering 57 subjects. For MMLU inference code, please refer to this project: 📖GitHub Wiki
Models | Type | Valid (0-shot) | Valid (5-shot) | Test (0-shot) | Test (5-shot) |
---|---|---|---|---|---|
Chinese-Mixtral-Instruct | chat | 65.1 | 69.6 | 67.5 | 69.8 |
Chinese-Mixtral | base | 63.2 | 67.1 | 65.5 | 68.3 |
Mixtral-8x7B-Instruct-v0.1 | chat | 68.5 | 70.4 | 68.2 | 70.2 |
Mixtral-8x7B-v0.1 | base | 64.9 | 69.0 | 67.0 | 69.5 |
Chinese-Alpaca-2-13B | chat | 49.6 | 53.2 | 50.9 | 53.5 |
Chinese-LLaMA-2-13B | base | 46.8 | 50.0 | 46.6 | 51.8 |
LongBench is a benchmark for evaluating the long-text understanding abilities of large models. It consists of 6 categories and 20 different tasks, most of which have an average length of 5K-15K words, totaling about 4.75K test items. Below are the evaluation results of this project's model on these Chinese tasks (including coding tasks). For LongBench inference code, please refer to this project: 📖GitHub Wiki
Models | Single-doc QA | Multi-doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Task | Avg |
---|---|---|---|---|---|---|---|
Chinese-Mixtral-Instruct | 50.3 | 34.2 | 16.4 | 42.0 | 56.1 | 89.5 | 48.1 |
Chinese-Mixtral | 32.0 | 23.7 | 0.4 | 42.5 | 27.4 | 14.0 | 23.3 |
Mixtral-8x7B-Instruct-v0.1 | 56.5 | 35.7 | 15.4 | 46.0 | 63.6 | 98.0 | 52.5 |
Mixtral-8x7B-v0.1 | 35.5 | 9.5 | 16.4 | 46.5 | 57.2 | 83.5 | 41.4 |
Chinese-Alpaca-2-13B-16K | 47.9 | 26.7 | 13.0 | 22.3 | 46.6 | 21.5 | 29.7 |
Chinese-LLaMA-2-13B-16K | 36.7 | 17.7 | 3.1 | 29.8 | 13.8 | 3.0 | 17.3 |
Chinese-Alpaca-2-7B-64K | 44.7 | 28.1 | 14.4 | 39.0 | 44.6 | 5.0 | 29.3 |
Chinese-LLaMA-2-7B-64K | 27.2 | 16.4 | 6.5 | 33.0 | 7.8 | 5.0 | 16.0 |
Under llama.cpp, the performance of the quantized version of the Chinese-Mixtral model was tested, as shown in the table below.
F16 | Q8_0 | Q6_K | Q5_K | Q5_0 | Q4_K | Q4_0 | Q3_K | IQ3_XXS | Q2_K | IQ2_XS | IQ2_XXS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Size (GB) | 87.0 | 46.2 | 35.7 | 30.0 | 30.0 | 24.6 | 24.6 | 19.0 | 17.1 | 16.1 | 12.7 | 11.4 |
BPW | 16.0 | 8.50 | 6.57 | 5.69 | 5.52 | 4.87 | 4.53 | 3.86 | 3.14 | 2.96 | 2.34 | 2.10 |
PPL | - | 4.4076 | 4.4092 | 4.4192 | 4.4224 | 4.4488 | 4.4917 | 4.5545 | 4.5990 | 5.1846 | 6.9784 | 8.5981 |
M3 Max Speed | - | - | 36.0 | 36.9 | 35.7 | 31.2 | 27.8 | 37.6 | - | 29.1 | - | - |
A100 Speed | - | - | 29.9 | 22.6 | 20.5 | 21.7 | 17.1 | 21.7 | 20.6 | 20.3 | 23.7 | 22.5 |
Note
- Model Size: in GB
- BPW (Bits-Per-Weight): Bits per unit parameter, e.g., Q6_K has an actual average precision of 6.57 bits
- PPL (Perplexity): Measured with a 4K context, lower values are better
- Generation Speed: Provided for Apple M3 Max (Metal) and NVIDIA A100 (40G) generation speed (ms/token), lower values are better
Taking Chinese-Mixtral-Q4_0 as an example, the following figure shows the trend of PPL changes under different context lengths, with two different sets of pure text data selected. The experimental results indicate that the context length supported by the Mixtral model exceeds the nominal 32K, and it still performs well in contexts of 64K+ (tested up to 128K).
- Based on the original Mixtral model, incremental training was carried out using large-scale unlabeled data to obtain the Chinese-Mixtral base model.
- Training data used the same data as in the Chinese-LLaMA-Alpaca project for the base model, totaling about 20G of pure text files.
- Training code and tutorial: 📖 Pre-training Scripts Wiki
- Based on Chinese-Mixtral, further fine-tuning was done using annotated instruction data to obtain the Chinese-Mixtral-Instruct instruction model.
- Training data used instruction data from the Chinese-LLaMA-Alpaca-2 project, totaling about 5 million instruction data.
- Training code and tutorial: 📖 Instruction Fine-Tuning Scripts Wiki
<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]
Note: <s>
and </s>
are special tokens indicating the start and end of a sequence, while [INST]
and [/INST]
are ordinary strings.
Please make sure to check the FAQ for existing solutions before raising an Issue. For specific questions and answers, refer to the project's 📖GitHub Wiki
Question 1: Will there be training with more data in the future? Will there be RLHF/DPO alignment?
Question 2: Why wasn't there an expansion of the Chinese vocabulary in this model?
Question 3: Is the downstream ecosystem of Mixtral supported?
@article{chinese-mixtral,
title={Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral},
author={Cui, Yiming and Yao, Xin},
journal={arXiv preprint arXiv:2403.01851},
url={https://arxiv.org/abs/2403.01851},
year={2024}
}
This project is based on the Mixtral model released by Mistral.ai and developed accordingly. Please strictly adhere to Mixtral's open-source license agreement during use. If third-party codes are involved, please comply with the relevant open-source licenses. The accuracy of the content generated by the model may be affected by computational methods, random factors, and loss of quantization precision. Therefore, this project does not guarantee the accuracy of the model's output and will not bear responsibility for any losses arising from the use of related resources and output results. If the models from this project are used for commercial purposes, developers should comply with local laws and regulations to ensure the compliance of the model output content. This project does not bear responsibility for any products or services derived from its use.
If you have any questions, please submit them in the GitHub Issues. Please raise issues politely to build a harmonious discussion community.
- Before submitting an issue, please check if the FAQ can solve your problem, and it is also advisable to review past issues.
- When submitting an issue, please use the Issue template set by this project to help quickly identify specific problems.
- Duplicate or unrelated issues will be handled by stable-bot, thank you for your understanding.