Skip to content

Commit

Permalink
[Feature] Add conversion scripts for LLaVA-Llama-3-8B (#618)
Browse files Browse the repository at this point in the history
* update

* update

* fix typo

* Update README.md

* Update README.md
  • Loading branch information
LZHgrla authored Apr 28, 2024
1 parent 1cd3628 commit 81d66e6
Show file tree
Hide file tree
Showing 5 changed files with 363 additions and 33 deletions.
2 changes: 1 addition & 1 deletion xtuner/configs/llava/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2
```

## Model Convert (and Merge)
## Model Conversion (and Merge)

After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them.

Expand Down
146 changes: 115 additions & 31 deletions xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,28 @@
<img src="https://github.com/InternLM/xtuner/assets/36994684/a157638c-3500-44ed-bfab-d8d8249f91bb" alt="Image" width=500" />
</div>

| Model | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA | TextVQA | MME | MMStar | Configs | Pretrained Projector Checkpoints | Fine-tuned LLaVA Checkpoints |
| :-------------------- | :---------------: | :---------------: | :---------: | :-------: | :------: | :-------: | :------------: | :-----------------: | :--: | :--: | :-----: | :------: | :----: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------: |
| LLaVA-v1.5-7B | 66.5 | 59.0 | 27.5 | 35.3 | 60.5 | 54.8 | 70.4 | 44.9 | 85.9 | 62.0 | 58.2 | 1511/348 | 30.3 | - | - | - |
| LLaVA-Llama-3-8B | 68.9 | 61.6 | 30.4 | 36.8 | 69.8 | 60.9 | 73.3 | 47.3 | 87.2 | 63.5 | 58.0 | 1506/295 | 38.2 | [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b) |
| LLaVA-Llama-3-8B-v1.1 | 72.3 | 66.4 | 31.6 | 36.8 | 70.1 | 70.0 | 72.9 | 47.7 | 86.4 | 62.6 | 59.0 | 1469/349 | 45.1 | [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1) |
| Model | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA | TextVQA | MME | MMStar | Configs |
| :-------------------- | :---------------: | :---------------: | :---------: | :-------: | :------: | :-------: | :------------: | :-----------------: | :--: | :--: | :-----: | :------: | :----: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| LLaVA-v1.5-7B | 66.5 | 59.0 | 27.5 | 35.3 | 60.5 | 54.8 | 70.4 | 44.9 | 85.9 | 62.0 | 58.2 | 1511/348 | 30.3 | - |
| LLaVA-Llama-3-8B | 68.9 | 61.6 | 30.4 | 36.8 | 69.8 | 60.9 | 73.3 | 47.3 | 87.2 | 63.5 | 58.0 | 1506/295 | 38.2 | [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) |
| LLaVA-Llama-3-8B-v1.1 | 72.3 | 66.4 | 31.6 | 36.8 | 70.1 | 70.0 | 72.9 | 47.7 | 86.4 | 62.6 | 59.0 | 1469/349 | 45.1 | [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py) |

## Resources

- LLaVA-Llama-3-8B-v1.1

- Official LLaVA format model (`xtuner/llava-llama-3-8b-v1_1-hf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-hf)
- HuggingFace LLaVA format model (`xtuner/llava-llama-3-8b-v1_1-transformers`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-transformers)
- XTuner LLaVA format model (`xtuner/llava-llama-3-8b-v1_1`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1)
- GGUF model (`xtuner/llava-llama-3-8b-v1_1-gguf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-gguf)
- Pretrained projector weights: 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-pretrain)

- LLaVA-Llama-3-8B

- Official LLaVA format model (`xtuner/llava-llama-3-8b-hf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-hf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-hf)
- HuggingFace LLaVA format model (`xtuner/llava-llama-3-8b-transformers`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-transformers) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-transformers)
- XTuner LLaVA format model (`xtuner/llava-llama-3-8b`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b)
- Pretrained projector weights: 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-pretrain)

## Data Preparation

Expand Down Expand Up @@ -268,21 +285,23 @@ xtuner train llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretr
xtuner train llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune --deepspeed deepspeed_zero2 --seed 1024
```
## Model Convert (and Merge)
## Model Conversion (and Merge)
After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them.
### Step 0. Convert `.pth` file to LLaVA model in xtuner format ([xtuner/llava-llama-3-8b-v1_1](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1))
After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them to the LLaVA model in xtuner format.
```bash
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_hf
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_xtuner
```
At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
If you use the default configuration of LLaVA-Llama-3-8B, you will obtain the following file structure after converting.
It includes the full-finetuned LLM weights, projector weights, and LoRA weights of the visual encoder.
```
./iter_39620_hf
./iter_39620_xtuner
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
Expand All @@ -309,47 +328,112 @@ It includes the full-finetuned LLM weights, projector weights, and LoRA weights
    └── README.md
```
## Chat
We can achieve image-text question answering with the following command!
At this time, the LLaVA model of xtuner-format can engage in conversation using xtuner chat, by
```bash
xtuner chat ./iter_39620_hf \
xtuner chat ./iter_39620_xtuner \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_hf \
--llava ./iter_39620_xtuner \
--prompt-template llama3_chat \
--image $IMAGE_PATH
```
Here, `./iter_39620_hf` is the converted weight from the above step or our [LLaVA-Llama-3-8B](https://huggingface.co/xtuner/llava-llama-3-8b) and [LLaVA-Llama-3-8B-v1.1](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) models.
and in MMBench evaluation, by
## Evaluation
Coming soon!
Now, we can use `xtuner mmbench` to conduct the [MMBench](https://mmbench.opencompass.org.cn/home) evaluation.
```bash
xtuner mmbench ./iter_39620_xtuner \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_xtuner \
--prompt-template llama3_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
```
1. Download the MMBench dataset with
Here, `$DATA_PATH` refers to one of the mmbench datasets. You can download the expected data by
```
```bash
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
```
2. Evaluate models with
### Step 1. Merge ViT LoRA into the original ViT
Because LoRA fine-tuning is applied to ViT during the fine-tuning, it is necessary to first merge LoRA into ViT.
```bash
xtuner mmbench ./iter_39620_hf \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_hf \
--prompt-template llama3_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
xtuner convert merge openai/clip-vit-large-patch14-336 ./iter_39620_xtuner/visual_encoder_adapter ./iter_39620_visual_encoder --is-clip
```
### Step 2. Convert LLaVA in xtuner format to official LLaVA format or HuggingFace LLaVA format
- The official LLaVA format is structured similarly to the architecture of the [liuhaotian/llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b) model.
- The HuggingFace LLaVA format is structured similarly to the architecture of the [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model.
#### To official LLaVA format ([xtuner/llava-llama-3-8b-v1_1-hf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf))
We can utilize the following command to obtain the LLaVA model in the official LLaVA format.
```bash
python ./convert_xtuner_weights_to_llava.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_llava
```
Here, the converted LLaVA model in official LLaVA format is saved to `./iter_39620_llava`.
```
./iter_39620_llava
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
```
#### To HuggingFace LLaVA format ([xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers))
We can utilize the following command to obtain the LLaVA model in the HuggingFace LLaVA format.
```bash
python ./convert_xtuner_weights_to_hf.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_hf
```
Here, the converted LLaVA model in HuggingFace LLaVA format is saved to `./iter_39620_hf`.
```
./iter_39620_hf
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
```
## Chat
- XTuner LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1#quickstart)
- Official LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf#quickstart)
- HuggingFace LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers#quickstart)
- GGUF format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf#quickstart)
Here, `$DATA_PATH` refers to one of the datasets downloaded as mentioned above, such as `MMBench_DEV_EN.tsv`. `./iter_39620_hf` is the converted weight from the above step or our released [LLaVA-Llama-3-8B](https://huggingface.co/xtuner/llava-llama-3-8b) and [LLaVA-Llama-3-8B-v1.1](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) models.
## Deployment
After the evaluation is completed, if it's a development set, it will directly print out the results; If it's a test set, you need to submit `mmbench_result.xlsx` to the official MMBench for final evaluation to obtain precision results!
[LMDeploy](https://github.com/InternLM/lmdeploy) now supports the deployment of official LLaVA format models (e.g.,[xtuner/llava-llama-3-8b-v1_1-hf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf)). For specifics, please refer to [here](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf#chat-by-lmdeploy).
Loading

0 comments on commit 81d66e6

Please sign in to comment.