Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLaVA support, modify generate function #820

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

zazamrykh
Copy link

This is continuation of #818 topic. Now make PR to dev branch.
Also I've decided that it would be good idea for generate function work with any type of input type and return type. I decided that it would be more readable and simply if generate function will generate sequence using embeddings as input to forward function. It is because string and tokens can be casted to embeddings but embeddings cannot be casted backward to tokens or string. And I made possible to give any input (string, list of strings, tokens, embeddings) ang get any output of function. Maybe there is something that I didn't make correct. For example, I'm not sure if my implementation is okay with positional embeddings.

I've met problem, LLaVA processor = AutoProcessor.from_pretrained(model_id, revision="a272c74") do not have apply_chat_template method at transformer version of project. So it is needed to update transformer library for this method work.

I've made test for any type of input and return types with gpt-2 model and it passed. But outputs of gpt-2 looks very poor though maybe for gpt-2 it is normal
image
I've also rerun all LLaVA.ipynb file and even tested other types of output of generate function and all worked pretty well.

  • New feature (non-breaking change which adds functionality)

  • I have commented my code, particularly in hard-to-understand areas

  • I have made corresponding changes to the documentation

  • My changes generate no new warnings

  • I have added tests that prove my fix is effective or that my feature works

@jojje
Copy link

jojje commented Dec 27, 2024

@zazamrykh I saw the initial struggles with the branch selection (dev-3) and appreciate the effort you're making to get this PR in shape for dev. Will be incredibly useful to have transformerlens work also with llava (vision) models. Just wanted to show some early appreciation for the effort in the hope you'll stick it out and not abandon the effort midway due to "pulling your hair" in frustration, while not knowing if anyone else really cares. I care :)

@zazamrykh
Copy link
Author

@zazamrykh I saw the initial struggles with the branch selection (dev-3) and appreciate the effort you're making to get this PR in shape for dev. Will be incredibly useful to have transformerlens work also with llava (vision) models. Just wanted to show some early appreciation for the effort in the hope you'll stick it out and not abandon the effort midway due to "pulling your hair" in frustration, while not knowing if anyone else really cares. I care :)

I am very glad to hear such a words of supporting) Changes in one function invoke changes in others so I've changed some functions which initially I haven't thought change. Then I've broke generate function, then, as I've thought, I've fixed it, but now some testes are not passing! I'll try to clarify what is the reason of failing tests. If you want to investigate VLM now, I think you can get last commit of feature-llava-support branch from forked repository, but I hope llava support appears soon in transformerlens!

@bryce13950
Copy link
Collaborator

@zazamrykh I merged dev into your PR earlier today, and took a look at the changes. This is a pretty big change, so it may require a bit of testing. As for the current test failure, it very while might have something to do with a float being rounded down at some point. There are quite a few places where ints are now accepted as params in this change, so if you need some ideas of where to start looking, that may be a good place.

Also, I really like splitting out the insides of the generate function. Just make sure to add proper doc strings before this is merged, so that those functions are documented properly in the public docs.

If you need help wrapping anything up, don't hesitate to ask. I took a quick glance just to get an idea of the size of the change, but I have not done a detailed look at the changes. I normally wait until the tests are passing to do that, but I am happy to help resolve any outstanding tasks if need be.

@tomasruizt
Copy link

tomasruizt commented Jan 6, 2025

Very interesting work @zazamrykh ! If I understand it correctly, you use a vision model LLaVa which is not supported by HookedTransformer to turn the prompt incl. image into embeddings. And then you pass the embeddings as an input to a model that is supported by HookedTransformer (llama-7b-hf). The two models have to fit together, right? I assume that llama-7b-hf is the decoder that LLaVa was trained with?

@zazamrykh
Copy link
Author

Very interesting work @zazamrykh ! If I understand it correctly, you use a vision model LLaVa which is not supported by HookedTransformer to turn the prompt incl. image into embeddings. And then you pass the embeddings as an input to a model that is supported by HookedTransformer (llama-7b-hf). The two models have to fit together, right? I assume that llama-7b-hf is the decoder that LLaVa was trained with?

Thank you for your support! Yes you understand everything correctly. In LLaVA presents vison_tower (CLIP), language_model (vicuna) and multi_modal_projector. So, LLaVA from hugging face is required for 2 things:

  1. Transform image to patch embeddings using llava.vision_tower (CLIP) then project them into language_model space and union them with textual embeddings. It implemented in function get_llm_input_embeddings (third cell) in demonstration https://github.com/TransformerLensOrg/TransformerLens/blob/792ae653169251d8d1f5f2d46134fc9de4bd3fc6/demos/LLaVA.ipynb
  2. Use hugging face weights of LLaVA for loading them into HookedTransformer and feed united embeddings into that model.

Thus, two models are from the same hf LLaVA. As I remember, LLaVA use vicuna (fine-tuned LLaMA) with expanded vocab size for special tokens designation. And thank you for your question it force me to think about addition changes which maybe should be added to TransformerLens)

@zazamrykh
Copy link
Author

@bryce13950

When I answered previous question I've got some new thoughts. Again, for VLM work with TransformerLens we need

  1. transform image and text to common HookedTransformer input using hf llava.visual_tower, llava.multi_modal_projector, tokenizer (processor), embed and pos_embed (if needed) layers from language_model. Pretty much stuff!
  2. load weights form hf llava.language_model (vicuna (llama)) to HookedTransformer.

For 1) point we should use something as gel_llm_input_embeddings from https://github.com/TransformerLensOrg/TransformerLens/blob/792ae653169251d8d1f5f2d46134fc9de4bd3fc6/demos/LLaVA.ipynb, for 2) corresponding code from the same link.
As for me, is good idea to make gel_llm_input_embeddings and model loading be part of some class, maybe we should do inheritance from HookedTransformer and implement HookedVisualTransformer with get_llm_input_embeddings method or/and overriding forward and generate function to support image input and call get_llm_input_embeddings under hood. And maybe get_llm_input_embeddings should be renamed to simply get_embeddings or get_input_embeddings. @bryce13950 what do you think about it? If it sounds good for you too, should be changes in HookedTransformer.py that I made stay in HookedTransformer, or they should be transfered fully or partly to HookedVisualTransformer? And inheritance or delegation (maybe make HookedTransformer being an attribute of HookedVisualTransformer) should be used in that case? You can implement that by yourself if number of changes are to big to complain. Or you can explain to me how it should suppose to be.

@bryce13950
Copy link
Collaborator

@zazamrykh I am kind of wondering if it makes sense to move the vision specific functionality to a HookedVisionTransformer or something of the sort, similar to how there is a HookedDecoder, and HookedEncoderDecoder. Fundamentally, what you are doing is extremely similar to the existing Transformer, but there is a layer for managing the images, which is wholly unique to vision models. If we simply extended HookedTransformer, we could basically create new functions to actually use the model, without being restrained by the existing structure, while also ensuring that we are not overcomplicating the Transformer structure with vision specific functionality. That would kind of free you up to really do whatever you feel is best for the input, while still being able to rely on the existing Transformer implementation. Any thoughts on this approach?

@zazamrykh
Copy link
Author

@bryce13950 So, you propose to make inheritance from HookedTransformer and create HookedVisionTransformer that will be capable to get image and text as input to it and will returns cached activation and work as HookedTransformer. I think that is good. But I think that changes that I've made in HookedTransformers should remain. I mean changes in generate function which allow using any type of input and any type of output. If cut off that changes, generate function should be overrided, there would be much copypaste in it.
And I have question. Can I somehow run tests on github server automatically before I push changes? Or maybe even some specifically test, that is currently failing. I have no power pc currently and even when I had it was long time for waiting if some test fails.

@bryce13950
Copy link
Collaborator

@zazamrykh I agree with your approach on this!
As for the question about GitHub, there really isn't too much that can be done for that. These sort of test just take a very long time by the very nature of what is being tested. However, I do have access to a pro colab account that has been made available to people helping with TransformerLens. You are limited to notebooks, but you can find a way to test quite a bit with colab. You are more than welcome to use that account if you would like. DM me on slack if you want me to get you setup on it.

@tomasruizt
Copy link

@zazamrykh Your notebook made me think if there could be a better way to show multimodal attention patterns. I forked the circuitsvis library and prototyped it. The idea is that when we ask a question like "how many huskies?" the model should be attending to the relevant image tokens. Let me know if this is interesting :)

https://github.com/tomasruizt/CircuitsVis

image

@zazamrykh
Copy link
Author

@zazamrykh Your notebook made me think if there could be a better way to show multimodal attention patterns. I forked the circuitsvis library and prototyped it. The idea is that when we ask a question like "how many huskies?" the model should be attending to the relevant image tokens. Let me know if this is interesting :)

https://github.com/tomasruizt/CircuitsVis

image

That's undoubtedly interesting! Some time ago I've seen some article with similar attention visualisation between text and image tokens but I can't find it now ( You work looks great. I think, that's difficult, but very useful for mutimodal models exploration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants