-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLaVA support, modify generate function #820
base: dev
Are you sure you want to change the base?
Conversation
…parate functions get_pos_offset and get_residual
@zazamrykh I saw the initial struggles with the branch selection (dev-3) and appreciate the effort you're making to get this PR in shape for dev. Will be incredibly useful to have transformerlens work also with llava (vision) models. Just wanted to show some early appreciation for the effort in the hope you'll stick it out and not abandon the effort midway due to "pulling your hair" in frustration, while not knowing if anyone else really cares. I care :) |
I am very glad to hear such a words of supporting) Changes in one function invoke changes in others so I've changed some functions which initially I haven't thought change. Then I've broke generate function, then, as I've thought, I've fixed it, but now some testes are not passing! I'll try to clarify what is the reason of failing tests. If you want to investigate VLM now, I think you can get last commit of feature-llava-support branch from forked repository, but I hope llava support appears soon in transformerlens! |
@zazamrykh I merged dev into your PR earlier today, and took a look at the changes. This is a pretty big change, so it may require a bit of testing. As for the current test failure, it very while might have something to do with a float being rounded down at some point. There are quite a few places where ints are now accepted as params in this change, so if you need some ideas of where to start looking, that may be a good place. Also, I really like splitting out the insides of the generate function. Just make sure to add proper doc strings before this is merged, so that those functions are documented properly in the public docs. If you need help wrapping anything up, don't hesitate to ask. I took a quick glance just to get an idea of the size of the change, but I have not done a detailed look at the changes. I normally wait until the tests are passing to do that, but I am happy to help resolve any outstanding tasks if need be. |
Very interesting work @zazamrykh ! If I understand it correctly, you use a vision model |
Thank you for your support! Yes you understand everything correctly. In LLaVA presents vison_tower (CLIP), language_model (vicuna) and multi_modal_projector. So, LLaVA from hugging face is required for 2 things:
Thus, two models are from the same hf LLaVA. As I remember, LLaVA use vicuna (fine-tuned LLaMA) with expanded vocab size for special tokens designation. And thank you for your question it force me to think about addition changes which maybe should be added to TransformerLens) |
When I answered previous question I've got some new thoughts. Again, for VLM work with TransformerLens we need
For 1) point we should use something as gel_llm_input_embeddings from https://github.com/TransformerLensOrg/TransformerLens/blob/792ae653169251d8d1f5f2d46134fc9de4bd3fc6/demos/LLaVA.ipynb, for 2) corresponding code from the same link. |
@zazamrykh I am kind of wondering if it makes sense to move the vision specific functionality to a |
@bryce13950 So, you propose to make inheritance from |
@zazamrykh I agree with your approach on this! |
@zazamrykh Your notebook made me think if there could be a better way to show multimodal attention patterns. I forked the |
That's undoubtedly interesting! Some time ago I've seen some article with similar attention visualisation between text and image tokens but I can't find it now ( You work looks great. I think, that's difficult, but very useful for mutimodal models exploration. |
This is continuation of #818 topic. Now make PR to dev branch.
Also I've decided that it would be good idea for generate function work with any type of input type and return type. I decided that it would be more readable and simply if generate function will generate sequence using embeddings as input to forward function. It is because string and tokens can be casted to embeddings but embeddings cannot be casted backward to tokens or string. And I made possible to give any input (string, list of strings, tokens, embeddings) ang get any output of function. Maybe there is something that I didn't make correct. For example, I'm not sure if my implementation is okay with positional embeddings.
I've met problem, LLaVA processor = AutoProcessor.from_pretrained(model_id, revision="a272c74") do not have apply_chat_template method at transformer version of project. So it is needed to update transformer library for this method work.
I've made test for any type of input and return types with gpt-2 model and it passed. But outputs of gpt-2 looks very poor though maybe for gpt-2 it is normal
I've also rerun all LLaVA.ipynb file and even tested other types of output of generate function and all worked pretty well.
New feature (non-breaking change which adds functionality)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works