Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README and add images #308

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 52 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,82 +287,70 @@ If you're seeking to explore the diverse capabilities within Yi's thriving famil

### 📊 Base model performance

| Model | MMLU | CMMLU | C-Eval | GAOKAO | BBH | Common-sense Reasoning | Reading Comprehension | Math & Code |
| :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: |
| | 5-shot | 5-shot | 5-shot | 0-shot | 3-shot@1 | - | - | - |
| LLaMA2-34B | 62.6 | - | - | - | 44.1 | 69.9 | 68.0 | 26.0 |
| LLaMA2-70B | 68.9 | 53.3 | - | 49.8 | 51.2 | 71.9 | 69.4 | 36.8 |
| Baichuan2-13B | 59.2 | 62.0 | 58.1 | 54.3 | 48.8 | 64.3 | 62.4 | 23.0 |
| Qwen-14B | 66.3 | 71.0 | 72.1 | 62.5 | 53.4 | 73.3 | 72.5 | **39.8** |
| Skywork-13B | 62.1 | 61.8 | 60.6 | 68.1 | 41.7 | 72.4 | 61.4 | 24.9 |
| InternLM-20B | 62.1 | 59.0 | 58.8 | 45.5 | 52.5 | 78.3 | - | 30.4 |
| Aquila-34B | 67.8 | 71.4 | 63.1 | - | - | - | - | - |
| Falcon-180B | 70.4 | 58.0 | 57.8 | 59.0 | 54.0 | 77.3 | 68.8 | 34.0 |
| Yi-6B | 63.2 | 75.5 | 72.0 | 72.2 | 42.8 | 72.3 | 68.7 | 19.8 |
| Yi-6B-200K | 64.0 | 75.3 | 73.5 | 73.9 | 42.0 | 72.0 | 69.1 | 19.0 |
| **Yi-34B** | **76.3** | **83.7** | 81.4 | 82.8 | **54.3** | **80.1** | 76.4 | 37.1 |
| Yi-34B-200K | 76.1 | 83.6 | **81.9** | **83.4** | 52.7 | 79.7 | **76.6** | 36.3 |

While benchmarking open-source models, we have observed a disparity between the
results generated by our pipeline and those reported in public sources (e.g.
OpenCompass). Upon conducting a more in-depth investigation of this difference,
we have discovered that various models may employ different prompts,
post-processing strategies, and sampling techniques, potentially resulting in
significant variations in the outcomes. Our prompt and post-processing strategy
remains consistent with the original benchmark, and greedy decoding is employed
during evaluation without any post-processing for the generated content. For
scores that were not reported by the original authors (including scores reported
with different settings), we try to get results with our pipeline.

To evaluate the model's capability extensively, we adopted the methodology
outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande,
ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ
were incorporated to evaluate reading comprehension. CSQA was exclusively tested
using a 7-shot setup, while all other tests were conducted with a 0-shot
configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1),
HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due
to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score
is derived by averaging the scores on the remaining tasks. Since the scores for
these two tasks are generally lower than the average, we believe that
Falcon-180B's performance was not underestimated.
![Base model performance_heat map](images/1.jpg)
**Color Legend:**
- Green: Represents high values.
- Yellow: Indicates values somewhat lower than the highest.
- Red: Represents low values.

**Evaluation Methods and Findings**

- **Disparity in Results**: While benchmarking open-source models, a disparity has been noted between results from our pipeline and those reported by public sources like OpenCompass.
- **Investigation Findings**: A deeper investigation reveals that variations in prompts, post-processing strategies, and sampling techniques across models may lead to significant outcome differences.
- **Consistency**: Our methodology aligns with the original benchmarks—consistent prompts and post-processing strategies are used, and greedy decoding is applied during evaluations without any post-processing for the generated content.
- **Efforts to Retrieve Unreported Scores**: For scores that were not reported by the original authors (including scores reported with different settings), we try to get results with our pipeline.
- **Extensive Model Evaluation**: To evaluate the model’s capability extensively, we adopted the methodology outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande, ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ were incorporated to evaluate reading comprehension.
- **Special Configurations**: CSQA was exclusively tested using a 7-shot setup, while all other tests were conducted with a 0-shot configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1), HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code".
- **Falcon-180B Caveat**: Falcon-180B was not tested on QuAC and OBQA due to technical constraints. Its performance score is an average from other tasks, and considering the generally lower scores of these two tasks, Falcon-180B's capabilities are likely not underestimated.

**Comprehensive Performance Evaluation of Yi-34B in Global Model Benchmarks**

- **Overall Performance in Global Benchmarks**: From a more comprehensive assessment, in global large-scale model evaluations, Yi-34B also performs outstandingly in key benchmark sets like "MMLU" (Massive Multitask Language Understanding), BBH, and others that reflect the comprehensive capabilities of models.

- **Yi-34B's Strengths in Diverse Domains**: It excels in MMLU, common-sense reasoning, reading comprehension, and other indicators, aligning highly with Hugging Face's evaluations.

- **Areas of Improvement**: However, like LLaMA2, the Yi series models are slightly behind GPT models in the math and code evaluations of GSM8k and MBPP. Zero-One Infinity's technical approach tends to preserve the model's general capabilities as much as possible during the pre-training phase, hence not incorporating an excessive amount of math and code data.

- **Future Developments and Research**: The research team has previously conducted in-depth explorations in mathematics in works like "Mammoth: Building math generalist models through hybrid instruction tuning". In the future, Zero-One Infinity's series of open-source plans will introduce specialized continued training models for code and mathematical abilities.

### 📊 Chat model performance

| Model | MMLU | MMLU | CMMLU | CMMLU | C-Eval(val)<sup>*</sup> | C-Eval(val)<sup>*</sup> | Truthful QA | BBH | BBH | GSM8k | GSM8k |
| ----------------------- | --------- | --------- | --------- | --------- | ----------------------- | ----------------------- | ----------- | --------- | --------- | --------- | --------- |
| | 0-shot | 5-shot | 0-shot | 5-shot | 0-shot | 5-shot | 0-shot | 0-shot | 3-shot | 0-shot | 4-shot |
| LLaMA2-13B-Chat | 50.88 | 47.33 | 27.47 | 35.08 | 27.93 | 35.88 | 36.84 | 32.90 | 58.22 | 36.85 | 2.73 |
| LLaMA2-70B-Chat | 59.42 | 59.86 | 36.10 | 40.99 | 34.99 | 41.31 | 53.95 | 42.36 | 58.53 | 47.08 | 58.68 |
| Baichuan2-13B-Chat | 55.09 | 50.14 | 58.64 | 59.47 | 56.02 | 54.75 | 48.98 | 38.81 | 47.15 | 45.72 | 23.28 |
| Qwen-14B-Chat | 63.99 | 64.98 | 67.73 | 70.57 | 66.12 | 70.06 | 52.49 | 49.65 | 54.98 | 59.51 | 61.18 |
| InternLM-Chat-20B | 55.55 | 57.42 | 53.55 | 53.75 | 51.19 | 53.57 | 51.75 | 42.41 | 36.68 | 15.69 | 43.44 |
| AquilaChat2-34B v1.2 | 65.15 | 66.70 | 67.51 | 70.02 | **82.99** | **89.38** | **64.33** | 20.12 | 34.28 | 11.52 | 48.45 |
| Yi-6B-Chat | 58.24 | 60.99 | 69.44 | 74.71 | 68.80 | 74.22 | 50.58 | 39.70 | 47.15 | 38.44 | 44.88 |
| Yi-6B-Chat-8bits(GPTQ) | 58.29 | 60.96 | 69.21 | 74.69 | 69.17 | 73.85 | 49.85 | 40.35 | 47.26 | 39.42 | 44.88 |
| Yi-6B-Chat-4bits(AWQ) | 56.78 | 59.89 | 67.70 | 73.29 | 67.53 | 72.29 | 50.29 | 37.74 | 43.62 | 35.71 | 38.36 |
| Yi-34B-Chat | **67.62** | 73.46 | **79.11** | **81.34** | 77.04 | 78.53 | 62.43 | 51.41 | **71.74** | **71.65** | **75.97** |
| Yi-34B-Chat-8bits(GPTQ) | 66.24 | **73.69** | 79.05 | 81.23 | 76.82 | 78.97 | 61.84 | **52.08** | 70.97 | 70.74 | 75.74 |
| Yi-34B-Chat-4bits(AWQ) | 65.77 | 72.42 | 78.21 | 80.50 | 75.71 | 77.27 | 61.84 | 48.30 | 69.39 | 70.51 | 74.00 |

We evaluated various benchmarks using both zero-shot and few-shot methods, except for TruthfulQA. Generally, the zero-shot approach is more common in chat models. Our evaluation strategy involves generating responses while following instructions explicitly or implicitly (such as using few-shot examples). We then isolate relevant answers from the generated text. Some models are not well-suited to produce output in the specific format required by instructions in few datasets, which leads to suboptimal results.
![Base model performance_heat map](images/2.jpg)
**Color Legend:**
- Green: Represents high values.
- Yellow: Indicates values somewhat lower than the highest.
- Red: Represents low values.

![Base model performance_heat map](images/2.2.jpg)
**Performance Evaluation**
- **Evaluation Methods**: We evaluated various benchmarks using both zero-shot and few-shot methods, except for TruthfulQA.

- **Zero-Shot vs. Few-Shot**: In chat models, the zero-shot approach is more commonly employed.

- **Evaluation Strategy**: Our evaluation strategy involves generating responses while following instructions explicitly or implicitly (such as using few-shot examples). We then isolate relevant answers from the generated text.

- **Challenges Faced**: Some models are not well-suited to produce output in the specific format required by instructions in few datasets, which leads to suboptimal results.

<strong>*</strong>: C-Eval results are evaluated on the validation datasets

### 📊 Quantized chat model performance

We also provide both 4-bit (AWQ) and 8-bit (GPTQ) quantized Yi chat models. Evaluation results on various benchmarks have shown that the quantized models have negligible losses. Additionally, they reduce the memory footprint size. After testing different configurations of prompts and generation lengths, we highly recommend following the guidelines in the memory footprint table below when selecting a device to run our models.
- **Quantized Models Offered**:
We also provide both 4-bit (AWQ) and 8-bit (GPTQ) quantized Yi chat models.
- **Performance Evaluation**: Evaluation results on various benchmarks have shown that the quantized models have negligible losses.
- **Memory Footprint Reduction**: Additionally, they reduce the memory footprint size.
- **Guidelines for Device Selection**: After testing different configurations of prompts and generation lengths, we highly recommend following the guidelines in the memory footprint table below when selecting a device to run our models.

| | batch=1 | batch=4 | batch=16 | batch=32 |
| ----------------------- | ------- | ------- | -------- | -------- |
| Yi-34B-Chat | 65GiB | 68GiB | 76GiB | >80GiB |
| Yi-34B-Chat-8bits(GPTQ) | 35GiB | 37GiB | 46GiB | 58GiB |
| Yi-34B-Chat-4bits(AWQ) | 19GiB | 20GiB | 30GiB | 40GiB |
| Yi-6B-Chat | 12GiB | 13GiB | 15GiB | 18GiB |
| Yi-6B-Chat-8bits(GPTQ) | 7GiB | 8GiB | 10GiB | 14GiB |
| Yi-6B-Chat-4bits(AWQ) | 4GiB | 5GiB | 7GiB | 10GiB |
![Base model performance_heat map](images/3.jpg)
**Color Legend:**
- Green: Represents high values.
- Yellow: Indicates values somewhat lower than the highest.
- Red: Represents low values.

Note: All the numbers in the table represent the minimum recommended memory for running models of the corresponding size.

![Base model performance_heat map](images/3.2.jpg)

### ⛔️ Limitations of chat model

The released chat model has undergone exclusive training using Supervised Fine-Tuning (SFT). Compared to other standard chat models, our model produces more diverse responses, making it suitable for various downstream tasks, such as creative scenarios. Furthermore, this diversity is expected to enhance the likelihood of generating higher quality responses, which will be advantageous for subsequent Reinforcement Learning (RL) training.
Expand Down
Binary file added images/1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/2.2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/3.2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading