[Feature] Add conversion scripts for LLaVA-Llama-3-8B (#618)

* update * update * fix typo * Update README.md * Update README.md
InternLM · Apr 28, 2024 · 81d66e6 · 81d66e6
1 parent 1cd3628
commit 81d66e6
Show file tree

Hide file tree

Showing 5 changed files with 363 additions and 33 deletions.
diff --git a/xtuner/configs/llava/README.md b/xtuner/configs/llava/README.md
@@ -48,7 +48,7 @@ NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_
 NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2
 ```
 
-## Model Convert (and Merge)
+## Model Conversion (and Merge)
 
 After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them.
 

diff --git a/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/README.md b/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/README.md
@@ -6,11 +6,28 @@
 <img src="https://github.com/InternLM/xtuner/assets/36994684/a157638c-3500-44ed-bfab-d8d8249f91bb" alt="Image" width=500" />
 </div>
 
-| Model                 | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU  Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA  | TextVQA |   MME    | MMStar |                                                                                                        Configs                                                                                                         |                                                                   Pretrained Projector Checkpoints                                                                   |                                                            Fine-tuned LLaVA Checkpoints                                                            |
-| :-------------------- | :---------------: | :---------------: | :---------: | :-------: | :------: | :-------: | :------------: | :-----------------: | :--: | :--: | :-----: | :------: | :----: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------: |
-| LLaVA-v1.5-7B         |       66.5        |       59.0        |    27.5     |   35.3    |   60.5   |   54.8    |      70.4      |        44.9         | 85.9 | 62.0 |  58.2   | 1511/348 |  30.3  |                                                                                                           -                                                                                                            |                                                                                  -                                                                                   |                                                                         -                                                                          |
-| LLaVA-Llama-3-8B      |       68.9        |       61.6        |    30.4     |   36.8    |   69.8   |   60.9    |      73.3      |        47.3         | 87.2 | 63.5 |  58.0   | 1506/295 |  38.2  |           [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py)           |      🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-pretrain)      |      🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b)      |
-| LLaVA-Llama-3-8B-v1.1 |       72.3        |       66.4        |    31.6     |   36.8    |   70.1   |   70.0    |      72.9      |        47.7         | 86.4 | 62.6 |  59.0   | 1469/349 |  45.1  | [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1) |
+| Model                 | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU  Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA  | TextVQA |   MME    | MMStar |                                                                                                        Configs                                                                                                         |
+| :-------------------- | :---------------: | :---------------: | :---------: | :-------: | :------: | :-------: | :------------: | :-----------------: | :--: | :--: | :-----: | :------: | :----: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| LLaVA-v1.5-7B         |       66.5        |       59.0        |    27.5     |   35.3    |   60.5   |   54.8    |      70.4      |        44.9         | 85.9 | 62.0 |  58.2   | 1511/348 |  30.3  |                                                                                                           -                                                                                                            |
+| LLaVA-Llama-3-8B      |       68.9        |       61.6        |    30.4     |   36.8    |   69.8   |   60.9    |      73.3      |        47.3         | 87.2 | 63.5 |  58.0   | 1506/295 |  38.2  |           [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py)           |
+| LLaVA-Llama-3-8B-v1.1 |       72.3        |       66.4        |    31.6     |   36.8    |   70.1   |   70.0    |      72.9      |        47.7         | 86.4 | 62.6 |  59.0   | 1469/349 |  45.1  | [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py) |
+
+## Resources
+
+- LLaVA-Llama-3-8B-v1.1
+
+  - Official LLaVA format model (`xtuner/llava-llama-3-8b-v1_1-hf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-hf)
+  - HuggingFace LLaVA format model (`xtuner/llava-llama-3-8b-v1_1-transformers`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-transformers)
+  - XTuner LLaVA format model (`xtuner/llava-llama-3-8b-v1_1`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1)
+  - GGUF model (`xtuner/llava-llama-3-8b-v1_1-gguf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-gguf)
+  - Pretrained projector weights: 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-pretrain)
+
+- LLaVA-Llama-3-8B
+
+  - Official LLaVA format model (`xtuner/llava-llama-3-8b-hf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-hf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-hf)
+  - HuggingFace LLaVA format model (`xtuner/llava-llama-3-8b-transformers`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-transformers) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-transformers)
+  - XTuner LLaVA format model (`xtuner/llava-llama-3-8b`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b)
+  - Pretrained projector weights: 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-pretrain)
 
 ## Data Preparation
 
@@ -268,21 +285,23 @@ xtuner train llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretr
 xtuner train llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune --deepspeed deepspeed_zero2 --seed 1024
 ```
 
-## Model Convert (and Merge)
+## Model Conversion (and Merge)
 
-After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them.
+### Step 0. Convert `.pth` file to LLaVA model in xtuner format ([xtuner/llava-llama-3-8b-v1_1](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1))
+
+After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them to the LLaVA model in xtuner format.
 
 ```bash
 xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
-# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_hf
+# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_xtuner
 ```
 
 At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
 If you use the default configuration of LLaVA-Llama-3-8B, you will obtain the following file structure after converting.
 It includes the full-finetuned LLM weights, projector weights, and LoRA weights of the visual encoder.
 
 ```
-./iter_39620_hf
+./iter_39620_xtuner
 ├── config.json
 ├── generation_config.json
 ├── model-00001-of-00009.safetensors
@@ -309,47 +328,112 @@ It includes the full-finetuned LLM weights, projector weights, and LoRA weights
     └── README.md
 ```
 
-## Chat
-
-We can achieve image-text question answering with the following command!
+At this time, the LLaVA model of xtuner-format can engage in conversation using xtuner chat, by
 
 ```bash
-xtuner chat ./iter_39620_hf \
+xtuner chat ./iter_39620_xtuner \
   --visual-encoder openai/clip-vit-large-patch14-336 \
-  --llava ./iter_39620_hf \
+  --llava ./iter_39620_xtuner \
   --prompt-template llama3_chat \
   --image $IMAGE_PATH
 ```
 
-Here, `./iter_39620_hf` is the converted weight from the above step or our [LLaVA-Llama-3-8B](https://huggingface.co/xtuner/llava-llama-3-8b) and [LLaVA-Llama-3-8B-v1.1](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) models.
+and in MMBench evaluation, by
 
-## Evaluation
-
-Coming soon!
-
-Now, we can use `xtuner mmbench` to conduct the [MMBench](https://mmbench.opencompass.org.cn/home) evaluation.
+```bash
+xtuner mmbench ./iter_39620_xtuner \
+  --visual-encoder openai/clip-vit-large-patch14-336 \
+  --llava ./iter_39620_xtuner \
+  --prompt-template llama3_chat \
+  --data-path $DATA_PATH \
+  --work-dir $RESULT_PATH
+```
 
-1. Download the MMBench dataset with
+Here, `$DATA_PATH` refers to one of the mmbench datasets. You can download the expected data by
 
-```
+```bash
 wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
 wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
 wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
 wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
 wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
 ```
 
-2. Evaluate models with
+### Step 1. Merge ViT LoRA into the original ViT
+
+Because LoRA fine-tuning is applied to ViT during the fine-tuning, it is necessary to first merge LoRA into ViT.
 
 ```bash
-xtuner mmbench ./iter_39620_hf \
-  --visual-encoder openai/clip-vit-large-patch14-336 \
-  --llava ./iter_39620_hf \
-  --prompt-template llama3_chat \
-  --data-path $DATA_PATH \
-  --work-dir $RESULT_PATH
+xtuner convert merge openai/clip-vit-large-patch14-336 ./iter_39620_xtuner/visual_encoder_adapter ./iter_39620_visual_encoder --is-clip
+```
+
+### Step 2. Convert LLaVA in xtuner format to official LLaVA format or HuggingFace LLaVA format
+
+- The official LLaVA format is structured similarly to the architecture of the [liuhaotian/llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b) model.
+- The HuggingFace LLaVA format is structured similarly to the architecture of the [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model.
+
+#### To official LLaVA format ([xtuner/llava-llama-3-8b-v1_1-hf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf))
+
+We can utilize the following command to obtain the LLaVA model in the official LLaVA format.
+
+```bash
+python ./convert_xtuner_weights_to_llava.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_llava
+```
+
+Here, the converted LLaVA model in official LLaVA format is saved to `./iter_39620_llava`.
+
 ```
+./iter_39620_llava
+├── config.json
+├── generation_config.json
+├── model-00001-of-00009.safetensors
+├── model-00002-of-00009.safetensors
+├── model-00003-of-00009.safetensors
+├── model-00004-of-00009.safetensors
+├── model-00005-of-00009.safetensors
+├── model-00006-of-00009.safetensors
+├── model-00007-of-00009.safetensors
+├── model-00008-of-00009.safetensors
+├── model-00009-of-00009.safetensors
+├── model.safetensors.index.json
+├── preprocessor_config.json
+├── special_tokens_map.json
+├── tokenizer_config.json
+└── tokenizer.json
+```
+
+#### To HuggingFace LLaVA format ([xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers))
+
+We can utilize the following command to obtain the LLaVA model in the HuggingFace LLaVA format.
+
+```bash
+python ./convert_xtuner_weights_to_hf.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_hf
+```
+
+Here, the converted LLaVA model in HuggingFace LLaVA format is saved to `./iter_39620_hf`.
+
+```
+./iter_39620_hf
+├── config.json
+├── generation_config.json
+├── model-00001-of-00004.safetensors
+├── model-00002-of-00004.safetensors
+├── model-00003-of-00004.safetensors
+├── model-00004-of-00004.safetensors
+├── model.safetensors.index.json
+├── preprocessor_config.json
+├── special_tokens_map.json
+├── tokenizer_config.json
+└── tokenizer.json
+```
+
+## Chat
+
+- XTuner LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1#quickstart)
+- Official LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf#quickstart)
+- HuggingFace LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers#quickstart)
+- GGUF format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf#quickstart)
 
-Here, `$DATA_PATH` refers to one of the datasets downloaded as mentioned above, such as `MMBench_DEV_EN.tsv`. `./iter_39620_hf` is the converted weight from the above step or our released [LLaVA-Llama-3-8B](https://huggingface.co/xtuner/llava-llama-3-8b) and [LLaVA-Llama-3-8B-v1.1](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) models.
+## Deployment
 
-After the evaluation is completed, if it's a development set, it will directly print out the results; If it's a test set, you need to submit `mmbench_result.xlsx` to the official MMBench for final evaluation to obtain precision results!
+[LMDeploy](https://github.com/InternLM/lmdeploy) now supports the deployment of official LLaVA format models (e.g.,[xtuner/llava-llama-3-8b-v1_1-hf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf)). For specifics, please refer to [here](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf#chat-by-lmdeploy).