diff --git a/README.md b/README.md index 3286462..1fa86ea 100644 --- a/README.md +++ b/README.md @@ -17,46 +17,59 @@ This repository is the official implementation of [MMT-Bench](https://arxiv.org/ > \* KY, FM and JW contribute equally. > \# WS (shaowenqi@pjlab.org.cn) and KZ (zhangkaipeng@pjlab.org.cn) are correponding authors. +## Introduction +MMT-Bench is a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises 31, 325 meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding. ![overview](assets/overview.jpg) +## Evaluation Results Overview +- The closed-source proprietary model GPT-4o from OpenAI has taken a leading position in MMT-Bench, surpassing other models such as InternVL-chat, QWen-VL-Plus, GPT-4V, and GeminiProVision. Note that the open-source models InternVL-chat and QwenVL-Max closely follow GPT-4o. +![overview](assets/overall_progress.png) + +- GPT-4o performs well in visual recognition and captioning and improves a lot in visual perception compared with GPT-4V (20231106 & 20240409). +![overview](assets/metatask_eval.png) + + ## 🏆 Leaderboard ### Full Set -| Rank | Model | Overall | -|------|-----------------------|---------| -| 1 | InternVL-Chat-v1.2 | 63.4 | -| 2 | Qwen-VL-Plus | 62.3 | -| 3 | GPT-4V | 62.0 | -| 4 | GeminiProVision | 61.6 | -| 5 | LLaVA-NEXT-34B | 60.8 | -| 6 | XComposer2 | 55.7 | -| 7 | BLIP2 | 54.8 | -| 8 | Yi-VL-34B | 54.2 | -| 9 | Monkey-Chat | 53.4 | -| 10 | DeepSeek-VL-7B | 53.2 | -| 11 | Yi-VL-6B | 53.2 | -| 12 | LLaVA-NEXT-13B | 53.0 | -| 13 | TransCore-M | 52.7 | -| 14 | QWen-VL-Chat | 52.5 | -| 15 | Claude3V-Haiku | 52.2 | -| 16 | XComposer | 52.1 | -| 17 | mPLUG-Owl2 | 52.0 | -| 18 | RBDash-v1-13B | 51.8 | -| 19 | LLaVA-v1.5-13B | 51.7 | -| 20 | CogVLM-Chat | 51.6 | -| 21 | ShareGPT4V-7B | 51.5 | -| 22 | LLaVA-NEXT-7B | 51.1 | -| 23 | LLaVA-v1.5-13B-XTuner | 51.1 | -| 24 | LLaVA-InternLM2-7B | 50.8 | -| 25 | LLaVA-v1.5-7B-XTuner | 50.2 | -| 26 | SharedCaptioner | 49.9 | -| 27 | LLaVA-InternLM-7B | 49.7 | -| 28 | LLaVA-v1.5-7B | 49.5 | -| 29 | LLaMA-Adapter-v2-7B | 40.4 | -| 30 | VisualGLM-6B | 38.6 | -| 31 | Frequency Guess | 31.7 | -| 32 | Random Guess | 28.5 | +| Rank | Model | Score | +|------|-----------------------------|-------| +| 1 | GPT4o | 65.5 | +| 2 | InternVL-Chat-v1.2-34B | 63.4 | +| 3 | QwenVLMax | 62.4 | +| 4 | Qwen-VL-Plus | 62.3 | +| 5 | GeminiProVision | 61.6 | +| 6 | GPT4V_20240409 | 61.1 | +| 7 | LLaVA-NEXT-34B | 60.8 | +| 8 | XComposer2 | 55.7 | +| 9 | BLIP2 | 54.8 | +| 10 | GPT4V_20231106 | 54.7 | +| 11 | Yi-VL-34B | 54.2 | +| 12 | Monkey-Chat | 53.4 | +| 13 | DeepSeek-VL-7B | 53.2 | +| 14 | Yi-VL-6B | 53.2 | +| 15 | LLaVA-NEXT-13B | 53.0 | +| 16 | TransCore-M | 52.7 | +| 17 | QWen-VL-Chat | 52.5 | +| 18 | Claude3V_Haiku | 52.2 | +| 19 | XComposer | 52.1 | +| 20 | mPLUG-Owl2 | 52.0 | +| 21 | RBDash-v1-13B | 51.8 | +| 22 | LLaVA-v1.5-13B | 51.7 | +| 23 | CogVLM-Chat | 51.6 | +| 24 | ShareGPT4V-7B | 51.5 | +| 25 | LLaVA-NEXT-7B | 51.1 | +| 26 | LLaVA-v1.5-13B-XTuner | 51.1 | +| 27 | LLaVA-InternLM2-7B | 50.8 | +| 28 | LLaVA-v1.5-7B-XTuner | 50.2 | +| 29 | SharedCaptioner | 49.9 | +| 30 | LLaVA-InternLM-7B | 49.7 | +| 31 | LLaVA-v1.5-7B | 49.5 | +| 32 | LLaMA-Adapter-v2-7B | 40.4 | +| 33 | VisualGLM-6B | 38.6 | +| 34 | Frequency Guess | 31.7 | +| 35 | Random Guess | 28.5 | ### VAL Split diff --git a/assets/metatask_eval.png b/assets/metatask_eval.png new file mode 100644 index 0000000..a9e47b1 Binary files /dev/null and b/assets/metatask_eval.png differ diff --git a/assets/overall_progress.png b/assets/overall_progress.png new file mode 100644 index 0000000..27e15d7 Binary files /dev/null and b/assets/overall_progress.png differ