diff --git a/README.md b/README.md index 50367c14..e7325e61 100644 --- a/README.md +++ b/README.md @@ -287,82 +287,70 @@ If you're seeking to explore the diverse capabilities within Yi's thriving famil ### 📊 Base model performance -| Model | MMLU | CMMLU | C-Eval | GAOKAO | BBH | Common-sense Reasoning | Reading Comprehension | Math & Code | -| :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: | -| | 5-shot | 5-shot | 5-shot | 0-shot | 3-shot@1 | - | - | - | -| LLaMA2-34B | 62.6 | - | - | - | 44.1 | 69.9 | 68.0 | 26.0 | -| LLaMA2-70B | 68.9 | 53.3 | - | 49.8 | 51.2 | 71.9 | 69.4 | 36.8 | -| Baichuan2-13B | 59.2 | 62.0 | 58.1 | 54.3 | 48.8 | 64.3 | 62.4 | 23.0 | -| Qwen-14B | 66.3 | 71.0 | 72.1 | 62.5 | 53.4 | 73.3 | 72.5 | **39.8** | -| Skywork-13B | 62.1 | 61.8 | 60.6 | 68.1 | 41.7 | 72.4 | 61.4 | 24.9 | -| InternLM-20B | 62.1 | 59.0 | 58.8 | 45.5 | 52.5 | 78.3 | - | 30.4 | -| Aquila-34B | 67.8 | 71.4 | 63.1 | - | - | - | - | - | -| Falcon-180B | 70.4 | 58.0 | 57.8 | 59.0 | 54.0 | 77.3 | 68.8 | 34.0 | -| Yi-6B | 63.2 | 75.5 | 72.0 | 72.2 | 42.8 | 72.3 | 68.7 | 19.8 | -| Yi-6B-200K | 64.0 | 75.3 | 73.5 | 73.9 | 42.0 | 72.0 | 69.1 | 19.0 | -| **Yi-34B** | **76.3** | **83.7** | 81.4 | 82.8 | **54.3** | **80.1** | 76.4 | 37.1 | -| Yi-34B-200K | 76.1 | 83.6 | **81.9** | **83.4** | 52.7 | 79.7 | **76.6** | 36.3 | - -While benchmarking open-source models, we have observed a disparity between the -results generated by our pipeline and those reported in public sources (e.g. -OpenCompass). Upon conducting a more in-depth investigation of this difference, -we have discovered that various models may employ different prompts, -post-processing strategies, and sampling techniques, potentially resulting in -significant variations in the outcomes. Our prompt and post-processing strategy -remains consistent with the original benchmark, and greedy decoding is employed -during evaluation without any post-processing for the generated content. For -scores that were not reported by the original authors (including scores reported -with different settings), we try to get results with our pipeline. - -To evaluate the model's capability extensively, we adopted the methodology -outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande, -ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ -were incorporated to evaluate reading comprehension. CSQA was exclusively tested -using a 7-shot setup, while all other tests were conducted with a 0-shot -configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1), -HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due -to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score -is derived by averaging the scores on the remaining tasks. Since the scores for -these two tasks are generally lower than the average, we believe that -Falcon-180B's performance was not underestimated. +![Base model performance_heat map](images/1.jpg) +**Color Legend:** +- Green: Represents high values. +- Yellow: Indicates values somewhat lower than the highest. +- Red: Represents low values. + +**Evaluation Methods and Findings** + +- **Disparity in Results**: While benchmarking open-source models, a disparity has been noted between results from our pipeline and those reported by public sources like OpenCompass. +- **Investigation Findings**: A deeper investigation reveals that variations in prompts, post-processing strategies, and sampling techniques across models may lead to significant outcome differences. +- **Consistency**: Our methodology aligns with the original benchmarks—consistent prompts and post-processing strategies are used, and greedy decoding is applied during evaluations without any post-processing for the generated content. +- **Efforts to Retrieve Unreported Scores**: For scores that were not reported by the original authors (including scores reported with different settings), we try to get results with our pipeline. +- **Extensive Model Evaluation**: To evaluate the model’s capability extensively, we adopted the methodology outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande, ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ were incorporated to evaluate reading comprehension. +- **Special Configurations**: CSQA was exclusively tested using a 7-shot setup, while all other tests were conducted with a 0-shot configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1), HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". +- **Falcon-180B Caveat**: Falcon-180B was not tested on QuAC and OBQA due to technical constraints. Its performance score is an average from other tasks, and considering the generally lower scores of these two tasks, Falcon-180B's capabilities are likely not underestimated. + +**Comprehensive Performance Evaluation of Yi-34B in Global Model Benchmarks** + +- **Overall Performance in Global Benchmarks**: From a more comprehensive assessment, in global large-scale model evaluations, Yi-34B also performs outstandingly in key benchmark sets like "MMLU" (Massive Multitask Language Understanding), BBH, and others that reflect the comprehensive capabilities of models. + +- **Yi-34B's Strengths in Diverse Domains**: It excels in MMLU, common-sense reasoning, reading comprehension, and other indicators, aligning highly with Hugging Face's evaluations. + +- **Areas of Improvement**: However, like LLaMA2, the Yi series models are slightly behind GPT models in the math and code evaluations of GSM8k and MBPP. Zero-One Infinity's technical approach tends to preserve the model's general capabilities as much as possible during the pre-training phase, hence not incorporating an excessive amount of math and code data. + +- **Future Developments and Research**: The research team has previously conducted in-depth explorations in mathematics in works like "Mammoth: Building math generalist models through hybrid instruction tuning". In the future, Zero-One Infinity's series of open-source plans will introduce specialized continued training models for code and mathematical abilities. ### 📊 Chat model performance -| Model | MMLU | MMLU | CMMLU | CMMLU | C-Eval(val)* | C-Eval(val)* | Truthful QA | BBH | BBH | GSM8k | GSM8k | -| ----------------------- | --------- | --------- | --------- | --------- | ----------------------- | ----------------------- | ----------- | --------- | --------- | --------- | --------- | -| | 0-shot | 5-shot | 0-shot | 5-shot | 0-shot | 5-shot | 0-shot | 0-shot | 3-shot | 0-shot | 4-shot | -| LLaMA2-13B-Chat | 50.88 | 47.33 | 27.47 | 35.08 | 27.93 | 35.88 | 36.84 | 32.90 | 58.22 | 36.85 | 2.73 | -| LLaMA2-70B-Chat | 59.42 | 59.86 | 36.10 | 40.99 | 34.99 | 41.31 | 53.95 | 42.36 | 58.53 | 47.08 | 58.68 | -| Baichuan2-13B-Chat | 55.09 | 50.14 | 58.64 | 59.47 | 56.02 | 54.75 | 48.98 | 38.81 | 47.15 | 45.72 | 23.28 | -| Qwen-14B-Chat | 63.99 | 64.98 | 67.73 | 70.57 | 66.12 | 70.06 | 52.49 | 49.65 | 54.98 | 59.51 | 61.18 | -| InternLM-Chat-20B | 55.55 | 57.42 | 53.55 | 53.75 | 51.19 | 53.57 | 51.75 | 42.41 | 36.68 | 15.69 | 43.44 | -| AquilaChat2-34B v1.2 | 65.15 | 66.70 | 67.51 | 70.02 | **82.99** | **89.38** | **64.33** | 20.12 | 34.28 | 11.52 | 48.45 | -| Yi-6B-Chat | 58.24 | 60.99 | 69.44 | 74.71 | 68.80 | 74.22 | 50.58 | 39.70 | 47.15 | 38.44 | 44.88 | -| Yi-6B-Chat-8bits(GPTQ) | 58.29 | 60.96 | 69.21 | 74.69 | 69.17 | 73.85 | 49.85 | 40.35 | 47.26 | 39.42 | 44.88 | -| Yi-6B-Chat-4bits(AWQ) | 56.78 | 59.89 | 67.70 | 73.29 | 67.53 | 72.29 | 50.29 | 37.74 | 43.62 | 35.71 | 38.36 | -| Yi-34B-Chat | **67.62** | 73.46 | **79.11** | **81.34** | 77.04 | 78.53 | 62.43 | 51.41 | **71.74** | **71.65** | **75.97** | -| Yi-34B-Chat-8bits(GPTQ) | 66.24 | **73.69** | 79.05 | 81.23 | 76.82 | 78.97 | 61.84 | **52.08** | 70.97 | 70.74 | 75.74 | -| Yi-34B-Chat-4bits(AWQ) | 65.77 | 72.42 | 78.21 | 80.50 | 75.71 | 77.27 | 61.84 | 48.30 | 69.39 | 70.51 | 74.00 | - -We evaluated various benchmarks using both zero-shot and few-shot methods, except for TruthfulQA. Generally, the zero-shot approach is more common in chat models. Our evaluation strategy involves generating responses while following instructions explicitly or implicitly (such as using few-shot examples). We then isolate relevant answers from the generated text. Some models are not well-suited to produce output in the specific format required by instructions in few datasets, which leads to suboptimal results. +![Base model performance_heat map](images/2.jpg) +**Color Legend:** +- Green: Represents high values. +- Yellow: Indicates values somewhat lower than the highest. +- Red: Represents low values. + +![Base model performance_heat map](images/2.2.jpg) +**Performance Evaluation** +- **Evaluation Methods**: We evaluated various benchmarks using both zero-shot and few-shot methods, except for TruthfulQA. + +- **Zero-Shot vs. Few-Shot**: In chat models, the zero-shot approach is more commonly employed. + +- **Evaluation Strategy**: Our evaluation strategy involves generating responses while following instructions explicitly or implicitly (such as using few-shot examples). We then isolate relevant answers from the generated text. + +- **Challenges Faced**: Some models are not well-suited to produce output in the specific format required by instructions in few datasets, which leads to suboptimal results. *: C-Eval results are evaluated on the validation datasets ### 📊 Quantized chat model performance -We also provide both 4-bit (AWQ) and 8-bit (GPTQ) quantized Yi chat models. Evaluation results on various benchmarks have shown that the quantized models have negligible losses. Additionally, they reduce the memory footprint size. After testing different configurations of prompts and generation lengths, we highly recommend following the guidelines in the memory footprint table below when selecting a device to run our models. +- **Quantized Models Offered**: +We also provide both 4-bit (AWQ) and 8-bit (GPTQ) quantized Yi chat models. +- **Performance Evaluation**: Evaluation results on various benchmarks have shown that the quantized models have negligible losses. +- **Memory Footprint Reduction**: Additionally, they reduce the memory footprint size. +- **Guidelines for Device Selection**: After testing different configurations of prompts and generation lengths, we highly recommend following the guidelines in the memory footprint table below when selecting a device to run our models. -| | batch=1 | batch=4 | batch=16 | batch=32 | -| ----------------------- | ------- | ------- | -------- | -------- | -| Yi-34B-Chat | 65GiB | 68GiB | 76GiB | >80GiB | -| Yi-34B-Chat-8bits(GPTQ) | 35GiB | 37GiB | 46GiB | 58GiB | -| Yi-34B-Chat-4bits(AWQ) | 19GiB | 20GiB | 30GiB | 40GiB | -| Yi-6B-Chat | 12GiB | 13GiB | 15GiB | 18GiB | -| Yi-6B-Chat-8bits(GPTQ) | 7GiB | 8GiB | 10GiB | 14GiB | -| Yi-6B-Chat-4bits(AWQ) | 4GiB | 5GiB | 7GiB | 10GiB | +![Base model performance_heat map](images/3.jpg) +**Color Legend:** +- Green: Represents high values. +- Yellow: Indicates values somewhat lower than the highest. +- Red: Represents low values. Note: All the numbers in the table represent the minimum recommended memory for running models of the corresponding size. +![Base model performance_heat map](images/3.2.jpg) + ### ⛔️ Limitations of chat model The released chat model has undergone exclusive training using Supervised Fine-Tuning (SFT). Compared to other standard chat models, our model produces more diverse responses, making it suitable for various downstream tasks, such as creative scenarios. Furthermore, this diversity is expected to enhance the likelihood of generating higher quality responses, which will be advantageous for subsequent Reinforcement Learning (RL) training. diff --git a/images/1.jpg b/images/1.jpg new file mode 100644 index 00000000..a914e0dd Binary files /dev/null and b/images/1.jpg differ diff --git a/images/2.2.jpg b/images/2.2.jpg new file mode 100644 index 00000000..1e7af5b6 Binary files /dev/null and b/images/2.2.jpg differ diff --git a/images/2.jpg b/images/2.jpg new file mode 100644 index 00000000..d755f19b Binary files /dev/null and b/images/2.jpg differ diff --git a/images/3.2.jpg b/images/3.2.jpg new file mode 100644 index 00000000..8c8c40b1 Binary files /dev/null and b/images/3.2.jpg differ diff --git a/images/3.jpg b/images/3.jpg new file mode 100644 index 00000000..6fea3bd3 Binary files /dev/null and b/images/3.jpg differ