LlaVA in MLX #461

nkasmanoff · 2024-02-19T17:20:20Z

Hi, please see the following for what I’ve made so far for converting a Llava checkpoint, (https://huggingface.co/llava-hf/llava-1.5-7b-hf) into an MLX implementation.

I chose this model because it was easy to load its accompanying implementation on HF, and compare the architecture / model weights side by side.

There’s a few other choices I’ve made to make this easier on the first pass, such as combining all of the safe-tensors shards into a single PyTorch file, that way I could lean on some prior MLX examples implementations. Obviously this could be a bit restrictive as the LlaVA models get bigger and this file size > RAM, but that is one of the several issues I wanted to point out in this draft I can hopefully get some help in fixing.

There are several TODOs in the repo showing these issues. I think most urgently, something I am confused by is how to reconcile the supposed # of attention heads (32) with what I’m saving from the downloaded weights.

I tried to outline my work in the notebook ‘Local Llava’ (can delete in the official PR) which is where I’d suggest anyone able to help get started from.

The other major TODO I also have left is figuring out how to do the forward pass & model.generate that takes into account text and image inputs, but figure that can wait until confirmation the model is actually loaded correctly.

Improved documentation, README, and tests to come along afterwards.

Thank you in advance to anyone who can help!

Feel free to close if another implementation comes and looks further along :-)

P.S: Another nice to have for this would be the ability to use other multi modal variants that accept different base models such as Phi and Mistral.

mzbac · 2024-02-19T22:59:28Z

Nice initiative. I am also attempting to port the llava to MLX, but I have not made any progress yet. I am happy to help in any way that I can, just clarify the issue you mentioned regarding reconciling the supposed number of attention heads (32). Could you provide more details on that?

nkasmanoff · 2024-02-20T13:11:16Z

@mzbac Thank you! Appreciate any help, or if this gives you a way to start a few steps ahead on your version of the port. Will explain my observation below:

According to the model config, the base LLM is Vicuna 7b 1.5, a fine-tuned version of Llama 2.

From that information, I used the config file of that model to fill out what the base LLM of LlaVa should be:

However, when I do so, I get the following error for loading in the language model weights:

What this tells me is that the pre-trained weights have a shape of 4096 x 4096, whereas our implementation of Llama2 uses the following:

Where 4096 (dim size) * 32 (num heads) = 131072, which leads me to believe there is something wrong either with how I saved the pre-trained weights, or some mismatch in how Torch saved the attention heads versus here.

If I were to update the model config, set num_heads = 1, this size mismatch won't happen. It's possible this is then actually correct, but I haven't gotten anywhere in the forward pass & inference section of the port which would quickly prove what's working or not.

mzbac · 2024-02-20T13:44:10Z

@mzbac Thank you! Appreciate any help, or if this gives you a way to start a few steps ahead on your version of the port. Will explain my observation below:

According to the model config, the base LLM is Vicuna 7b 1.5, a fine-tuned version of Llama 2.

From that information, I used the config file of that model to fill out what the base LLM of LlaVa should be:

However, when I do so, I get the following error for loading in the language model weights:

What this tells me is that the pre-trained weights have a shape of 4096 x 4096, whereas our implementation of Llama2 uses the following:

Where 4096 (dim size) * 32 (num heads) = 131072, which leads me to believe there is something wrong either with how I saved the pre-trained weights, or some mismatch in how Torch saved the attention heads versus here.

If I were to update the model config, set num_heads = 1, this size mismatch won't happen. It's possible this is then actually correct, but I haven't gotten anywhere in the forward pass & inference section of the port which would quickly prove what's working or not.

Yeah, there are two different formats of Llama models. One is the original PyTorch format released by Meta, and the other one uses HF format. By quickly checking the model link you shared, it seems like you are using the HF format one. You would use the Llama model in mlx-lm to load the weights (it can directly load the hf format model without converting), otherwise, there may be some issues due to improper loading of config and default configuration mismatch with actual weights used by the model.

Edit:

I can see that there is a difference in vocab_size between your configuration and the model configuration in lmsys/vicuna-7b-v1.5
The head_dim should be the hidden_size // n_head, so it should be 4096//32 instead of 4096. see https://github.com/facebookresearch/llama/blob/main/llama/model.py#L205 and https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/models/llama.py#L61

nkasmanoff · 2024-02-22T00:26:37Z

@mzbac that makes a ton of sense, thank you for the point it looks like the weights now load in properly! A ton of this code will inevitably need to be cleaned up before merging but next up in my mind is getting the processor and inference working, I'll share any updates I have on that when I can

Edit:

The vocab size difference is because LlaVA is an extension of that original Vicuna, but now with additional tokens. This is a bit confusing to me as well. According to transformers:

processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor.tokenizer.vocab_size
# 32000

Which agrees with what you are saying, but also according to huggingface, the vocab size is 32064 (https://huggingface.co/llava-hf/llava-1.5-7b-hf/blob/main/config.json#L37) .

nkasmanoff · 2024-02-22T01:11:37Z

@mzbac tracking here, looks like there's a decent chunk of tasks remaining. Not sure when I'll be able to get to them all, so for anyone reading this and curious (I think) these are the remaining steps we'd need to do:

For the forward pass:

Update the clip vision encoder output, it should use a specific hidden state specified in the config
Figure out how to merge vision embeddings into the text tokens (see https://github.com/huggingface/transformers/blob/4f09d0fd888dbf2660313f9715992822acfb99ce/src/transformers/models/llava/modeling_llava.py#L279)

Once done, this should then be able to be converted to actual generation and not just a single ouput, i.e. port over model.generate. Haven't looked this code over yet so I can't say what sub-tasks remain.

And then there's the general housekeeping. Deleting unused classes, finding speedups, etc.

I'd imagine it's also valuable to port over the text and image processor like was done for CLIP. That's something else that'd be great to add in.

mzbac · 2024-02-22T02:04:13Z

Yeah, I am currently working on making the clip model compatible with HF format and updating the model to support output hidden states. I think I can complete that by the end of this week. If you're okay with it, you can leave that task to me.
FYI: #472

nkasmanoff · 2024-02-23T13:15:46Z

@mzbac I just looked at your CLIP PR and looks great! Once that is merged I can sync my fork with main, and update the forward pass for llava. I still need to do some more research on how the full forward pass is implemented to make sure nothing else like this "surprises" me, that may take some time, so if you (or anyone reading) also feels comfortable trying your hand at this too it's no problem.

mzbac · 2024-02-23T23:47:54Z

Yeah, I started looking at Llava's forward pass and got some understanding of how it works. However, I will set up some tests to ensure there are no surprises in our implementation. I will share it with you once I have the setup.

mzbac · 2024-02-24T14:15:10Z

@nkasmanoff, while I was trying to set up the test, I noticed that I had to refactor quite a bit of code in order to be able to run the test and load the model. After writing two tests for image features and merging image features with input ids into input embeddings, most of the LLava model was completed (the rest is just a normal forward pass for llama LLM). I ended up finishing the llava model forward pass. Here is the complete code (https://github.com/mzbac/mlx-examples/tree/llava/llava), let me know if you would like to sync it back to your branch.
Note: The crucial part of the Llava model is merging image features with input IDs, which I had to implement using NumPy due to MLX not supporting boolean indexing. However, this operation only occurs once before generation, so I hope the performance won't be too bad.

awni · 2024-02-24T14:42:06Z

I'm really excited about the progress on this implementation!

Note: The crucial part of the Llava model is merging image features with input IDs, which I had to implement using NumPy due to MLX not supporting boolean indexing.

Could you point me to the code for that? Maybe there is a workaround using MLX still.

mzbac · 2024-02-24T14:45:13Z

I'm really excited about the progress on this implementation!

Note: The crucial part of the Llava model is merging image features with input IDs, which I had to implement using NumPy due to MLX not supporting boolean indexing.

Could you point me to the code for that? Maybe there is a workaround using MLX still.

Yeah, here is the merged image feature and input embedding part: https://github.com/mzbac/mlx-examples/blob/llava/llava/llava.py#L94-L131.
PS: If you pull down my fork branch, you can run python generate.py to test the performance. I didn't notice much performance impact.

nkasmanoff · 2024-02-24T14:58:16Z

@nkasmanoff, while I was trying to set up the test, I noticed that I had to refactor quite a bit of code in order to be able to run the test and load the model. After writing two tests for image features and merging image features with input ids into input embeddings, most of the LLava model was completed (the rest is just a normal forward pass for llama LLM). I ended up finishing the llava model forward pass. Here is the complete code (https://github.com/mzbac/mlx-examples/tree/llava/llava), let me know if you would like to sync it back to your branch. Note: The crucial part of the Llava model is merging image features with input IDs, which I had to implement using NumPy due to MLX not supporting boolean indexing. However, this operation only occurs once before generation, so I hope the performance won't be too bad.

@mzbac looks very good to me! I haven't made any major changes on my end, so I think syncing your changes would be ideal. Happy to test those changes on my end too.

Llava

nkasmanoff · 2024-02-28T00:59:55Z

@awni I think it is! I saw @mzbac as the reviewer hence the tag, but please take a look too :-)

llava/README.md

mzbac · 2024-02-28T01:43:44Z

llava/language.py

+        self,
+        inputs: mx.array,
+        cache=None,
+        inputs_embeds=None,


Most of the implementation is copied from mlx-lm's llama, with only updates made to the forward pass to allow for directly passing inputs_embeds for the initial prompt evaluation.

mzbac · 2024-02-28T01:45:57Z

llava/llava.py

+    def _merge_input_ids_with_image_features(
+        self, image_features, inputs_embeds, input_ids
+    ):
+
+        image_features = np.array(image_features)
+        inputs_embeds = np.array(inputs_embeds)
+        input_ids = np.array(input_ids)
+
+        _, num_image_patches, embed_dim = image_features.shape
+        batch_size, sequence_length = input_ids.shape
+
+        special_image_token_mask = input_ids == self.config.image_token_index
+        num_special_image_tokens = np.sum(special_image_token_mask, axis=-1)
+
+        # if no special image tokens found, return a warning
+        if np.all(num_special_image_tokens == 0):
+            logging.warning(
+                "No special image tokens found in the input. Please make sure to include <image> in your prompt."
+            )
+
+        # calculate the final sequence length. Will be the original sequence length + the # of image tokens to be inserted in.
+        final_sequence_length = (
+            np.max(num_special_image_tokens) * (num_image_patches - 1)
+        ) + sequence_length
+
+        non_image_indices = np.where(
+            input_ids != self.config.image_token_index)
+
+        new_token_positions = (
+            np.cumsum((special_image_token_mask *
+                      (num_image_patches - 1) + 1), axis=-1)
+            - 1
+        )
+        text_to_overwrite = new_token_positions[non_image_indices]
+
+        final_embedding = np.zeros(
+            (batch_size, final_sequence_length,
+             embed_dim), dtype=inputs_embeds.dtype
+        )
+
+        final_embedding[non_image_indices[0], text_to_overwrite, :] = inputs_embeds[
+            non_image_indices
+        ]
+
+        image_to_overwrite = np.all(final_embedding == 0, axis=-1)
+        reshaped_image_features = image_features.reshape(-1, embed_dim)
+        final_embedding[image_to_overwrite, :] = reshaped_image_features[
+            : np.sum(image_to_overwrite)
+        ]
+
+        return mx.array(final_embedding)


@awni, this is the part where we have to use NumPy as a workaround because MLX doesn't support boolean indexing. Maybe you could give us some pointers on how we can implement this using MLX.

I have a couple questions about how this function is supposed to work:

Shouldn't it be an error if the input text does not have the same number of <image> tokens as the image batch size?

Can the input text have a batch size other than 1? If so, how does this work with the image batch? Do all the input images need to have the same number of <image> tokens?

I also have a couple thoughts on how to change this to work better for MLX (and possibly be a lot simpler). Assuming the text has a batch size of 1:

Get the positions of the <image> tags and split the text embeddings based on that

Split the image embeddings along the batch axis.

Interleave the two sets and concatenate

In this case it should be fine from an efficiency standpoint since you typically have only one or two <image> tags. And, I think it would be simpler. This would not be the most efficient approach if you had a lot of images and just a few embeddings per image.

1, you are right. I skipped the error checking for mismatched image tokens and image batches to simplify the implementation, but we could definitely add it. FYI: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava/modeling_llava.py#L333-L337
2. My understanding is that as long as the total number of images matches the total image tokens in a batch, it should work with batches. Given the final embedding is batched, however, I have not done any batch testing.

Regarding 2, makes sense! Does the rest of the code work for batched text? E.g. the generation part properly handles different text examples? If it's just using our MLX LM llama implementation (which it looks like) then the answer is probably no. I don't know if that simplifies much though.

Correct me if I am wrong, but I thought the Llama example should be able to handle batched text generation. The only difference is that the initial prompt will get the embedding with text and image from the code above and directly go to LLM's encode layer without going through the embedding/norm layer. Once that forward pass is completed, we get the next token logits and key-value cache, then the rest of process is just a standard forward pass for LLM.

@awni, I don't have permission to push the changes. Maybe either you or @nkasmanoff could update it.

😬 sorry. Yea happy to, or @nkasmanoff can give you temporary permissions to push to his fork.

@mzbac just gave you permission, let me know if that worked!

@nkasmanoff @awni I have pushed the changes. Please note that the test.py will fail due to the fast_gelu to native gelu change, but if you build the latest mlx locally, the test will pass.

Hmm, we should put the version in the requirement in that case. Should work with 0.5 and up.

awni · 2024-03-01T04:21:24Z

Also it looks like this is almost ready to go. I will review a bit more tonight, but from what I looked at, it looked quite nice already!

nkasmanoff · 2024-03-01T14:36:51Z

Also it looks like this is almost ready to go. I will review a bit more tonight, but from what I looked at, it looked quite nice already!

@awni this is great to hear!

It depends if you think this is essential to this PR, but some ideas for additional features which are missing now that may be useful later on.

Adding a MLX native processor + tokenizer, rather than lean on transformers.
Make possible to use other variants of LlaVA, or VLMs.
Fine tuning (discussed previously) demo.

My opinion is we don't need any for first example, but is there anything else to consider?

@mzbac

awni · 2024-03-01T14:40:09Z

Thanks for the list @nkasmanoff, also think we can leave most of it as follow up depending on which direction this example goes.

awni

🚀

Really great addition, thank you @mzbac and @nkasmanoff !

nxphi47 · 2024-03-02T14:10:05Z

@nkasmanoff @mzbac Great works! Do you plan to support Llava-1.6-hf too ?
It does require this change from transformers: huggingface/transformers#29012

mzbac · 2024-03-04T00:10:43Z

@nkasmanoff @mzbac Great works! Do you plan to support Llava-1.6-hf too ? It does require this change from transformers: huggingface/transformers#29012

@nxphi47 would like to add support for llava 1.6, however, could you create an issue for that? Comments on the closed PR may get lost.

* add: llava mlx first draft * add: weights comparision * add forward pass skeleton * update: now imports weights correctly * delete base * latest * adding config * fix: use config * add mlx config * feat: add image processor for llava processor * wip * feat: llava working example * chore: refactor generate script * chore: clean up * add: warning to user if no <image> token despite using one * add: __call__ to LlavaModel * add: call to LlavaModel * update fp * clean up var names * update: native GeLU * Cleanup * update generate and readme * remove todo comment * rearrange tests * fix example code * nits in README * update readme * nit in readme * nits in README * chore(llava): refactor image embedding merging logic * min mlx version * nits in readmes * fix cli prompt, some nits * updates, slight simplify --------- Co-authored-by: anchen <li.anchen.au@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>

jrp2014 · 2024-03-23T15:57:30Z

1.6 seems to need an image_newline model parameter. It doesn’t seem to be a straight drop in replacement for 1.5

jrp2014 · 2024-06-14T21:44:59Z

Is v1.6 support abandoned, or are some of the newer mlx features helpful to getting it up and running?

awni · 2024-06-15T12:57:11Z

I think you can run it in MLX VLM, CC @Blaizzy

Blaizzy · 2024-06-15T14:20:54Z

Thanks @awni!

Llava 1.6 (Next) is coming to MLX VLM soon,

Alongside a trainer.

We currently support:

Llava 1.5
Deepseek VL
Idefics 2
Paligemma
NanoLlava
Phi-3-vision (almost done)

nkasmanoff added 3 commits February 19, 2024 12:03

add: llava mlx first draft

4d964bd

add: weights comparision

0e2a054

add forward pass skeleton

6e4a7ee

nkasmanoff added 2 commits February 21, 2024 19:21

update: now imports weights correctly

ed9d376

delete base

b83b1e5

latest

6e23847

nkasmanoff added 3 commits February 22, 2024 07:29

adding config

bb5b898

fix: use config

95f9df1

add mlx config

a1c6fe6

mzbac mentioned this pull request Feb 22, 2024

chore(clip): update the clip example to make it compatible with HF format #472

Merged

feat: add image processor for llava processor

cec0639

mzbac added 4 commits February 24, 2024 15:19

wip

4dd8bca

feat: llava working example

c4ea94f

chore: refactor generate script

b9aeade

chore: clean up

d8f7b89

nkasmanoff added 3 commits February 24, 2024 14:06

Merge pull request #1 from mzbac/llava

7fb1a39

Llava

add: warning to user if no <image> token despite using one

371a807

add: __call__ to LlavaModel

449f7d0

mzbac reviewed Feb 28, 2024

View reviewed changes

llava/README.md Show resolved Hide resolved

mzbac reviewed Feb 28, 2024

View reviewed changes

awni and others added 5 commits February 27, 2024 21:19

nits in README

c2c9411

update readme

8301c43

nit in readme

5c8f67d

nits in README

cd77bcf

chore(llava): refactor image embedding merging logic

b39c251

min mlx version

935ebb5

awni added 3 commits March 1, 2024 07:16

nits in readmes

683b7c4

fix cli prompt, some nits

b37891d

updates, slight simplify

7ace6ea

awni approved these changes Mar 1, 2024

View reviewed changes

awni merged commit a429263 into ml-explore:main Mar 1, 2024
2 checks passed

This was referenced Mar 2, 2024

[Feature Request] support for multimode model like LLaVA #344

Closed

[Feature Request] Example of MLLM using MLX #207

Closed

zhnext mentioned this pull request Mar 9, 2024

Add support for Llava-1.6 #551

Open

tctrautman mentioned this pull request Mar 21, 2024

[Feature Request] LLaVA 1.6 LoRA fine-tuning example #605

Open

raoulritter mentioned this pull request Apr 19, 2024

Unable to convert CogVLM due to the model not existing. #695

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlaVA in MLX #461

LlaVA in MLX #461

nkasmanoff commented Feb 19, 2024

mzbac commented Feb 19, 2024 •

edited

Loading

nkasmanoff commented Feb 20, 2024

mzbac commented Feb 20, 2024 •

edited

Loading

nkasmanoff commented Feb 22, 2024 •

edited

Loading

nkasmanoff commented Feb 22, 2024

mzbac commented Feb 22, 2024

nkasmanoff commented Feb 23, 2024

mzbac commented Feb 23, 2024 •

edited

Loading

mzbac commented Feb 24, 2024

awni commented Feb 24, 2024

mzbac commented Feb 24, 2024 •

edited

Loading

nkasmanoff commented Feb 24, 2024

nkasmanoff commented Feb 28, 2024

mzbac Feb 28, 2024

mzbac Feb 28, 2024

awni Feb 28, 2024

mzbac Feb 28, 2024

awni Feb 28, 2024

mzbac Feb 28, 2024

mzbac Feb 28, 2024

awni Feb 28, 2024

nkasmanoff Feb 28, 2024

mzbac Feb 28, 2024

awni Mar 1, 2024

awni commented Mar 1, 2024

nkasmanoff commented Mar 1, 2024

awni commented Mar 1, 2024

awni left a comment

nxphi47 commented Mar 2, 2024

mzbac commented Mar 4, 2024

jrp2014 commented Mar 23, 2024

jrp2014 commented Jun 14, 2024

awni commented Jun 15, 2024

Blaizzy commented Jun 15, 2024 •

edited

Loading

LlaVA in MLX #461

LlaVA in MLX #461

Conversation

nkasmanoff commented Feb 19, 2024

mzbac commented Feb 19, 2024 • edited Loading

nkasmanoff commented Feb 20, 2024

mzbac commented Feb 20, 2024 • edited Loading

nkasmanoff commented Feb 22, 2024 • edited Loading

nkasmanoff commented Feb 22, 2024

mzbac commented Feb 22, 2024

nkasmanoff commented Feb 23, 2024

mzbac commented Feb 23, 2024 • edited Loading

mzbac commented Feb 24, 2024

awni commented Feb 24, 2024

mzbac commented Feb 24, 2024 • edited Loading

nkasmanoff commented Feb 24, 2024

nkasmanoff commented Feb 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awni commented Mar 1, 2024

nkasmanoff commented Mar 1, 2024

awni commented Mar 1, 2024

awni left a comment

Choose a reason for hiding this comment

nxphi47 commented Mar 2, 2024

mzbac commented Mar 4, 2024

jrp2014 commented Mar 23, 2024

jrp2014 commented Jun 14, 2024

awni commented Jun 15, 2024

Blaizzy commented Jun 15, 2024 • edited Loading

mzbac commented Feb 19, 2024 •

edited

Loading

mzbac commented Feb 20, 2024 •

edited

Loading

nkasmanoff commented Feb 22, 2024 •

edited

Loading

mzbac commented Feb 23, 2024 •

edited

Loading

mzbac commented Feb 24, 2024 •

edited

Loading

Blaizzy commented Jun 15, 2024 •

edited

Loading