-
-
Notifications
You must be signed in to change notification settings - Fork 100
Replies: 2 comments · 2 replies
-
Can you please run |
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks @giladgd!
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Try running with a smaller context size and see whether it runs better on your machine: npx node-llama-cpp chat Llama-3.3-70B-Instruct-Q4_K_M.gguf --contextSize 10000 Also, try running with fewer threads: npx node-llama-cpp chat Llama-3.3-70B-Instruct-Q4_K_M.gguf --threads 4 Please let me know whether any of these helped you. It would also help me investigate this issue if you can run this command and share its output: npx -y node-llama-cpp inspect measure Llama-3.3-70B-Instruct-Q4_K_M.gguf |
Beta Was this translation helpful? Give feedback.
All reactions
-
Ahhh! Thanks so much @giladgd! The context size change fixed it :) I should have noticed that llama.cpp has a default context length of 4096, but even with 10000 it's already feeling much better. Thanks again :) Here's the output requested as well:
|
Beta Was this translation helpful? Give feedback.
-
If I try to run a 4bit quantized llama3.3 70b (https://huggingface.co/bartowski/Llama-3.3-70B-Instruct-GGUF/resolve/main/Llama-3.3-70B-Instruct-Q4_K_M.gguf), it seems to run MUCH slower through node-llama-cpp than it does through llama.cpp directly.
If you try:
Compared to:
The token generating speed doesn't necessarily seem that far off between the two, but the node version seems to lag my system significantly (with noticeable delay in finder, etc.).
Does anyone know why that is?
Beta Was this translation helpful? Give feedback.
All reactions