-
-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor utils #1 #161
Refactor utils #1 #161
Conversation
Running my harness, on your latest release: # from mlx import version as mlx_version
from mlx_vlm import load, generate, version as vlm_version
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
import subprocess
import time
import psutil
# print("mlx version:", mlx_version())
print("mlx-vlm version:", vlm_version)
output = subprocess.check_output(
["/opt/homebrew/Caskroom/miniconda/base/envs/mlx/bin/huggingface-cli", "scan-cache"]
)
lines = output.decode("utf-8").split("\n")[2:-4]
for line in lines:
print(80 * "v")
model_path = line.split()[0]
print("\033[1mRunning", model_path, "\033[0m")
process = psutil.Process()
mem_before = process.memory_info().rss
try:
# Load the model
model, tokenizer = load(model_path)
config = load_config(model_path)
except Exception as e:
print(f"Failed to load model at {model_path}: {e}")
continue
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
tokenizer, config, prompt, num_images=len(image)
)
# Generate output
try:
start_time = time.time()
output = generate(model, tokenizer, image, formatted_prompt, max_tokens=500, verbose=True)
end_time = time.time()
print(output)
except Exception as e:
print(f"Failed to generate output for model at {model_path}: {e}")
continue
mem_after = process.memory_info().rss
print(f"Output generated in {end_time - start_time:.2f}s")
print(f"Memory used: {(mem_after - mem_before) / (1024 * 1024 * 1024):.2f} GB")
print(80 * "^", end="\n\n") I get:
I think that you default trust_remote_code to True (which I think should be more cautious, but it doesn't seem to prevent the prompts). Biut more fundamentally, |
Facing the same problem above with 0.1.7 |
Hey @jrp2014 and @sachinraja13 Here are the changes that you need to make to your script:
Script ✅# from mlx import version as mlx_version
from mlx_vlm import load, generate, __version__ as vlm_version
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
import mlx.core as mx
import subprocess
import time
import psutil
# print("mlx version:", mlx_version())
print("mlx-vlm version:", vlm_version)
for model_path in [
"mlx-community/nanoLLaVA-1.5-4bit",
"mlx-community/Phi-3.5-vision-instruct-4bit",
"mlx-community/Qwen2-VL-2B-Instruct-4bit",
"HuggingFaceTB/SmolVLM-Instruct",
"mlx-community/Llama-3.2-11B-Vision-Instruct-4bit",
"mlx-community/idefics2-8b-4bit"
]:
print(80 * "v")
print("\033[1mRunning", model_path, "\033[0m")
process = psutil.Process()
mem_before = process.memory_info().rss
try:
# Load the model
trust_remote_code = True
model, tokenizer = load(model_path, trust_remote_code=trust_remote_code)
config = load_config(model_path, trust_remote_code=trust_remote_code)
except Exception as e:
print(f"Failed to load model at {model_path}: {e}")
continue
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
tokenizer, config, prompt, num_images=len(image)
)
# Generate output
try:
start_time = time.time()
output = generate(model, tokenizer, formatted_prompt, image, verbose=True, max_tokens=500)
end_time = time.time()
print(output)
except Exception as e:
print(f"Failed to generate output for model at {model_path}: {e}")
continue
mem_after = process.memory_info().rss
print(f"Output generated in {end_time - start_time:.2f}s")
print(f"Memory used: {(mem_after - mem_before) / (1024 * 1024 * 1024):.2f} GB")
print(80 * "^", end="\n\n")
del model, tokenizer
mx.metal.clear_cache() Outputmlx-vlm version: 0.1.7
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/nanoLLaVA-1.5-4bit
Fetching 11 files: 100%|██████████| 11/11 [00:50<00:00, 4.59s/it]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 219701.64it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg']
Prompt: <|im_start|>system
Answer the questions.<|im_end|><|im_start|>user
<image>
Describe this image.<|im_end|><|im_start|>assistant
The image shows a close-up view of two cats lying down on a pink fabric surface. Both cats have a striped pattern on their fur, with the one on the left having a darker shade of brown, and the one on the right having a lighter shade of brown. They are positioned in such a way that the left cat is facing the camera, while the right cat is looking away. The cats are lying on their stomachs, and the fabric surface is slightly wrinkled. The image has a sepia tone, which gives it a vintage or antique look. There are no texts or other objects in the image. The style of the image is a straightforward, candid photograph, capturing a moment of relaxation for the cats.
==========
Prompt: 21 tokens, 75.857 tokens-per-sec
Generation: 146 tokens, 143.235 tokens-per-sec
Peak memory: 1.408 GB
The image shows a close-up view of two cats lying down on a pink fabric surface. Both cats have a striped pattern on their fur, with the one on the left having a darker shade of brown, and the one on the right having a lighter shade of brown. They are positioned in such a way that the left cat is facing the camera, while the right cat is looking away. The cats are lying on their stomachs, and the fabric surface is slightly wrinkled. The image has a sepia tone, which gives it a vintage or antique look. There are no texts or other objects in the image. The style of the image is a straightforward, candid photograph, capturing a moment of relaxation for the cats.
Output generated in 2.06s
Memory used: 0.78 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Phi-3.5-vision-instruct-4bit
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 80530.64it/s]
[/opt/homebrew/Caskroom/miniconda/base/envs/mlx_code/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:524](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/mlx_code/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:524): FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
warnings.warn(
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 235194.62it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg']
Prompt: <|user|>
<|image_1|>Describe this image.<|end|>
<|assistant|>
The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid black cat. Both cats have their eyes closed, suggesting they are asleep. The couch has a pink cushion, and there are two remote controls on the couch.<|end|>
==========
Prompt: 771 tokens, 614.139 tokens-per-sec
Generation: 83 tokens, 33.288 tokens-per-sec
Peak memory: 3.704 GB
The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid black cat. Both cats have their eyes closed, suggesting they are asleep. The couch has a pink cushion, and there are two remote controls on the couch.<|end|>
Output generated in 4.51s
Memory used: 2.17 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Qwen2-VL-2B-Instruct-4bit
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 186037.68it/s]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 254902.45it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg']
Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe this image.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
The image shows two cats lying on a pink blanket. The cat on the left is striped with a mix of black, brown, and white, and it is lying on its side with its head resting on the blanket. The cat on the right is also striped, with a mix of black, brown, and white, and it is lying on its back with its head resting on the blanket as well. Both cats appear to be resting or sleeping, and there are two remote controls placed on the blanket next to them.
==========
Prompt: 416 tokens, 732.923 tokens-per-sec
Generation: 105 tokens, 169.277 tokens-per-sec
Peak memory: 3.704 GB
The image shows two cats lying on a pink blanket. The cat on the left is striped with a mix of black, brown, and white, and it is lying on its side with its head resting on the blanket. The cat on the right is also striped, with a mix of black, brown, and white, and it is lying on its back with its head resting on the blanket as well. Both cats appear to be resting or sleeping, and there are two remote controls placed on the blanket next to them.
Output generated in 1.94s
Memory used: 0.58 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running HuggingFaceTB/SmolVLM-Instruct
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 151601.35it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len.
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 181049.09it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg']
Prompt: <|im_start|>User:<image>Describe this image.<end_of_utterance>
Assistant:
Two cats are sleeping on a pink blanket.
==========
Prompt: 1195 tokens, 702.406 tokens-per-sec
Generation: 10 tokens, 78.259 tokens-per-sec
Peak memory: 6.007 GB
Two cats are sleeping on a pink blanket.
Output generated in 2.68s
Memory used: 4.22 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.2-11B-Vision-Instruct-4bit
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 145187.45it/s]
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 235929.60it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg']
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>
Describe this image.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>
The image shows two cats lying on a pink blanket, with two remote controls placed nearby. The cats are positioned in a way that suggests they are watching something on a television, and the remote controls are likely used to control the TV.
* Two cats:
+ One cat is smaller and has a fluffy tail
+ The other cat is larger and has a more mottled coat
+ Both cats are lying on their sides, with their heads turned towards the TV
* Two remote controls:
+ One remote control is placed near the smaller cat
+ The other remote control is placed near the larger cat
+ Both remote controls have a similar design and are likely used to control the TV
* A pink blanket:
+ The blanket is a bright pink color
+ It appears to be made of a soft, plush material
+ The blanket is spread out on a surface, possibly a couch or a bed
Overall, the image suggests that the cats are enjoying a relaxing afternoon, watching something on TV and using the remote controls to control the program.
==========
Prompt: 15 tokens, 2.941 tokens-per-sec
Generation: 221 tokens, 6.223 tokens-per-sec
Peak memory: 16.252 GB
The image shows two cats lying on a pink blanket, with two remote controls placed nearby. The cats are positioned in a way that suggests they are watching something on a television, and the remote controls are likely used to control the TV.
* Two cats:
+ One cat is smaller and has a fluffy tail
+ The other cat is larger and has a more mottled coat
+ Both cats are lying on their sides, with their heads turned towards the TV
* Two remote controls:
+ One remote control is placed near the smaller cat
+ The other remote control is placed near the larger cat
+ Both remote controls have a similar design and are likely used to control the TV
* A pink blanket:
+ The blanket is a bright pink color
+ It appears to be made of a soft, plush material
+ The blanket is spread out on a surface, possibly a couch or a bed
Overall, the image suggests that the cats are enjoying a relaxing afternoon, watching something on TV and using the remote controls to control the program.
Output generated in 41.39s
Memory used: 3.97 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/idefics2-8b-4bit
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 229539.02it/s]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 169622.59it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg']
Prompt: User: Describe this image.<image><end_of_utterance>
Assistant:
Two house cats are laying on a bed with a pink comforter. They are using remote controllers as toys.<end_of_utterance>
==========
Prompt: 79 tokens, 140.361 tokens-per-sec
Generation: 26 tokens, 44.843 tokens-per-sec
Peak memory: 16.252 GB
Two house cats are laying on a bed with a pink comforter. They are using remote controllers as toys.<end_of_utterance>
Output generated in 1.91s
Memory used: 0.75 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note
|
@jrp2014 and @sachinraja13 |
This is great, thank you so much @Blaizzy ! |
My pleasure! Happy new year in advance 🚀 |
This PR is the first of many to clean up the code base and simplify it.
It will make it easier to add new features such as image/video feature caching, KV cache quant and more.
Closes #160
Closes #94
Closes #135
Closes #144