Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to continue a conversation with more images? #68

Open
simonw opened this issue Sep 29, 2024 · 8 comments
Open

How to continue a conversation with more images? #68

simonw opened this issue Sep 29, 2024 · 8 comments

Comments

@simonw
Copy link

simonw commented Sep 29, 2024

It's not clear to me from looking at the code if this library supports the following pattern:

prompt 1: IMAGE1 - describe this image

... first response

prompt 2: IMAGE2 - compare with this image

... second response

Is this something the library can or could do? I'm interested in being able to implement multi-step conversations where images might be attached to future messages.

@Blaizzy
Copy link
Owner

Blaizzy commented Sep 29, 2024

Not yet. It's one of the things want to add next.

My focus at the moment is on the trainer and new models (pixtral, llama and molmo)

@Blaizzy
Copy link
Owner

Blaizzy commented Sep 29, 2024

It would be awesome if you could implement this

I would be more than happy to help, review and merge the PR🚀

@mark-lord
Copy link
Contributor

+1, would love to see this implemented

@Blaizzy
Copy link
Owner

Blaizzy commented Oct 11, 2024

I think this will be easier and faster to do after I release prompt caching.

That way you only are computing KV for the last message only.

@Blaizzy
Copy link
Owner

Blaizzy commented Oct 28, 2024

Hey guys,

I thought a about it and here is an example that you could use to build this use case.

I will work on a more robust example, showcase different models that support and add it as a chat CLI tool in the next release :)

The idea is to only add the image tag to the last use message in the messages/conversations list alongside the lastest image.

from mlx_vlm import load
import mlx.core as mx
from mlx_vlm.utils import generate_step, load_image
import time
model_mlx, processor = load("mlx-community/idefics2-8b-4bit")


# Image
url = "/path/to/your/image"
image = load_image(url)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": """The image shows a colorful chameleon sitting on a vibrant flower. The chameleon has a blue body with vibrant green and red stripes, and its eyes are wide open, giving it a curious and alert expression. The flower has a mix of pink, yellow, and red petals, adding to the vividness of the scene."""}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare this image to the previous one."},
            {"type": "image"} # used on the last user message in the list
        ]
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="np"
)

pixel_values = mx.array(inputs['pixel_values'])
input_ids = mx.array(inputs['input_ids'])
mask = mx.array(inputs['attention_mask'])

max_tokens = 1000
verbose = False # Set to True to stream the output

# Get the prompt tokens and the tokenizer
prompt_tokens = mx.array(processor.tokenizer.encode(text_prompt))
tokenizer = processor.tokenizer

# Initialize timing and detokenizer
tic = time.perf_counter()
detokenizer = processor.detokenizer
detokenizer.reset()

# Generate tokens
generator = generate_step(
    input_ids,
    model_mlx,
    pixel_values,
    mask,
    temperature=0.7,
)

prompt_time = 0
for (token, prob), n in zip(generator, range(max_tokens)):

    if n == 0:
        prompt_time = time.perf_counter() - tic
        tic = time.perf_counter()

    if token == tokenizer.eos_token_id and n > 0:
        break

    detokenizer.add_token(token)

    if verbose:
        print(detokenizer.last_segment, end="", flush=True)

    token_count = n + 1

detokenizer.finalize()

if verbose:
    print(detokenizer.last_segment, flush=True)
    gen_time = time.perf_counter() - tic
    print("=" * 10)
    if token_count == 0:
        print("No tokens generated for this prompt")
    prompt_tps = prompt_tokens.size / prompt_time
    gen_tps = (token_count - 1) / gen_time

    print(f"Prompt: {prompt_tps:.3f} tokens-per-sec")
    print(f"Generation: {gen_tps:.3f} tokens-per-sec")

# Print the generated text
print(detokenizer.text)

@Blaizzy
Copy link
Owner

Blaizzy commented Oct 28, 2024

Example output:

Screenshot 2024-10-28 at 1 16 41 AM

@simonw
Copy link
Author

simonw commented Oct 29, 2024

Looks like there's new code for chat in this branch: https://github.com/Blaizzy/mlx-vlm/tree/pc/video - e.g. 810fb53

@Blaizzy
Copy link
Owner

Blaizzy commented Oct 29, 2024

Yes there is :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants