Replies: 31 comments 109 replies
-
In version 2.8.1, the response is in 8 seconds, in the beta version it took more than 20 minutes (I got tired of waiting). The prompt was: Hi there, how are you? The example used was the simple chat example from each version. In the beta version, there is also an error: Cannot read properties of undefined (reading ‘disposed’) if I do not provide a contextSequence. |
Beta Was this translation helpful? Give feedback.
-
But what about the slowness for such simple prompts with the beta version? I’ve been waiting for over an hour to run another test. O código do teste é
|
Beta Was this translation helpful? Give feedback.
-
On 2.8 If I was sending large pieces of text or talked for too long it would break because it was not having more kv slots. Did not notice increase time of inference. PS: Using CPU only |
Beta Was this translation helpful? Give feedback.
-
prior to today, beta3 was able to load mixtral files. as of the latest update, it errors out with
I was mostly using: Can submit a bug report with sample code, but it happens simply during LlamaModel() instantiation |
Beta Was this translation helpful? Give feedback.
-
I'm trying to understand the difference in the handling of the batch size parameter between llama.cpp Hardware: Apple M2 Ultra tl;dr -- mixtral models appear to limit batch sizes to 512 unless you disable gpu layers. seems to be a llama.cpp bug. I can't work around this in node-llama-cpp by setting the batch size equal to the context size as a result. llama.cpp's testing with llama.cpp's
It looks like mixtral won't work with larger batch sizes and gpu. disabling gpu works, but is incredibly slow. testing with node-llama-cpp's
I would expect the two cases where context=4096, batch=512 to work same as llama.cpp |
Beta Was this translation helpful? Give feedback.
-
Hey @stewartoallen, I'm one of the maintainers at LangChain for the JS repo and I noticed Thanks again for adding this, and the rest of your work on this library! |
Beta Was this translation helpful? Give feedback.
-
I am on a Apple M1 Max 32GB. With Seems like its because its setting batchSize = contextSize by default, which I think my system cant handle. Setting an explicit smaller Works
Crashes
|
Beta Was this translation helpful? Give feedback.
-
Heya, looking great so far! One little request I have: when using "auto" for the chatWrapper type, it'd be great to have a public API to see which wrapper was chosen. I'm currently using |
Beta Was this translation helpful? Give feedback.
-
hi @giladgd have you already planned when will be possible to use Grammar and Functions together? |
Beta Was this translation helpful? Give feedback.
-
Is the loading of model synchronous? So far it seems the model gets loaded when a context instance is initiated, and it happens synchronously. Would it be possible to do this asynchronously? |
Beta Was this translation helpful? Give feedback.
-
the sample readme code no longer runs with the latest beta. looks like
|
Beta Was this translation helpful? Give feedback.
-
initialising a model swallows errors This code fails for me (its still something about the
|
Beta Was this translation helpful? Give feedback.
-
How is the chat formatting chosen for a given model when using LlamaChat? From my limited poking around the code, and some experiments with some models, it looks like its looking at the model name to estimate which chat syntax to use. When I run openhermes-2.5-mistral-7b for example, it seems to be using the syntax of Mistral Instruct, even though this model uses ChatML. More specifically, would it be possible to utilize tokenizer.chat_template from metadata to more accurately determine the formatting? |
Beta Was this translation helpful? Give feedback.
-
Trying out the beta and running into an issue with |
Beta Was this translation helpful? Give feedback.
-
@giladgd Hi, may I ask what this mean?
I'm reading code of ChatWrapper, and this is in the test of it. And seems I'm currently using a simple prompt template like this (not using ChatWrapper yet, as I'm still not sure how it works):
And it's genreated result is very bad on qwen1_5-32b-chat-q4_k_m.gguf, I'm not sure what's the problem, maybe llama.cpp requires we using special format like Seems special token is auto generated based on model? So we must use ChatWrapper to ensure this. And It generates some special character between Chinese character, I'm not sure if it is Update: I'm now using https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/openchat.jinja (which is said to compatable with qwen1.5 https://github.com/chujiezheng/chat_templates/blob/main/generation_configs/qwen2-chat.json ) But still get same messy result. const chatWrapper = new JinjaTemplateChatWrapper({
template: "{{ 'System: ' + systemPrompt if systemPrompt else '' }}{{ 'User: ' + userInput if userInput else '' }}",
...templates,
});
const session = new LlamaChatSession({
contextSequence: contextSequenceInstance,
autoDisposeSequence: false,
systemPrompt: sessionOptions?.systemPrompt,
chatWrapper,
});
await session.prompt(completionOptions.prompt, {
...completionOptions,
signal: abortController.signal,
onToken: (tokens) => {
if (modelInstance === undefined) {
abortController.abort();
runnerAbortControllers.delete(conversationID);
subscriber.next({ type: 'result', token: texts.disposed, id: conversationID });
subscriber.complete();
return;
}
updateTimeout();
subscriber.next({ type: 'result', token: modelInstance.detokenize(tokens), id: conversationID });
},
});
|
Beta Was this translation helpful? Give feedback.
-
Sorry if this has been asked before but do you have any plans of adding dynamic temperature into the beta? If not, is this something that can be contributed (and if so, do you have any recommendations)? I've been using it for quite a while now and its impact is noticeable, especially on creativity. Edit: Added two reference links Ref: https://github.com/ggerganov/llama.cpp/pull/4972/files |
Beta Was this translation helpful? Give feedback.
-
Transferred @nathanlesage's commend from #105 (comment):
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Hello and thanks for your work Gilad! I've been implementing an OpenAI compatible API on top of the beta during the last weekends. Some feedback/questions I collected on the way Keeping cancelled completions in context Output issue on longer context / shift Custom stop generation triggers plus chat wrappers |
Beta Was this translation helpful? Give feedback.
-
what is the proper way to unload/reload sessions/context/context-sequences? trying to do the simple thing of pausing and resuming chat sessions. is it simply using get/set chat history on a chat session where chat sessions are mapped to context sequences? working memory is the primary constraint so I'm also trying to understand lifecycle management and the intent for each of these abstractions. |
Beta Was this translation helpful? Give feedback.
-
Hi! Great library, thanks for all the hard work. We have a few questions regarding the beta:
|
Beta Was this translation helpful? Give feedback.
-
Sorry in advance if this is not the right place to ask but Is there an active discord channel for this package/beta? |
Beta Was this translation helpful? Give feedback.
-
In
|
Beta Was this translation helpful? Give feedback.
-
hi @giladgd , I've met a problem when using |
Beta Was this translation helpful? Give feedback.
-
@giladgd how would I take advantage of continuous batching in node llama? Is it on by default if I make multiple async calls to |
Beta Was this translation helpful? Give feedback.
-
I'm having some trouble reconciling llama-server and node-llama-cpp outputs, what are the things I should be looking out for? The outputs of node-llama-cpp are the ones I want and make sense, llama-server on the other hand has two issues, see below. Using node-llama-cppI'm using Inputconst llama = await getLlama({
vramPadding: 0,
debug: false,
});
const model = await llama.loadModel({
modelPath: <model-path>,
gpuLayers: 33
});
const context = await model.createContext({
contextSize: 512,
seed: 9,
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
contextSequence,
systemPrompt,
autoDisposeSequence: false,
});
const grammar = new LlamaJsonSchemaGrammar(this.llama, grammar);
const answer = await session.prompt(prompt, options);
console.log(answer) Output{ "feedback": "The response starts with 'Hello', which meets the criterion of saying hello. The rest of the text is irrelevant to this evaluation. Therefore, [Says hello] is True.","result": true } This is what I expect, yay! Using llama-serverThe server is started using this command: llama-server -m model/model.gguf --port 8080 --cont-batching --gpu-layers 33 --ctx-size 16384 --batch-size 16384 --parallel 32 --mlock Here's the version for llama-server:
InputHere's the payload, via {
"prompt": "### Response to evaluate:\n\nHello, world!\n\n### Score Rubrics:\n\n[Says hello]:\n\n- False: The response being evaluated does not meet the criterion described in the square brackets.\n- True: The response being evaluated does meet the criterion described in the square brackets.\n\n### Feedback:",
"temperature": 0,
"seed": 9,
"system_prompt": "You are a fair evaluation assistant tasked with providing clear, objective, self-consistent feedback based on a specific criterion.\n\nYou will be given a response to evaluate, a binary criterion to evaluate against, and (optionally) additional context to consider. You must provide feedback based on the given criterion and the response.\n\nPlease follow these guidelines:\n1. Write a detailed feedback that assess the quality of the response strictly based on the binary criterion.\n2. Your feedback should end by explicitly stating whether the criterion is met, explicitly using the words True or False.\n3. Keep your feedback concise and clear, do not repeat yourself and do not exceed 280 characters for the feedback.",
"json_schema": {
"type": "object",
"properties": {
"feedback": {
"type": "string"
},
"result": {
"type": "boolean"
}
}
}
}
### Output
```json
{
"content": "{\"feedback\": \"The response is a simple 'Hello, world!' message, which does not meet the criterion of saying hello. The response does not explicitly state 'hello' or any variation of it. Therefore, the criterion is not met. False.\"} ",
"id_slot": 0,
"stop": true,
"model": "model/model.gguf",
"tokens_predicted": 54,
"tokens_evaluated": 56,
"generation_settings": {
"n_ctx": 512,
"n_predict": -1,
"model": "model/model.gguf",
"seed": 9,
"temperature": 0.0,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"tfs_z": 1.0,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"penalty_prompt_tokens": [],
"use_penalty_prompt_tokens": false,
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"penalize_nl": false,
"stop": [],
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": false,
"logit_bias": [],
"n_probs": 0,
"min_keep": 0,
"grammar": "boolean ::= (\"true\" | \"false\") space\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nfeedback-kv ::= \"\\\"feedback\\\"\" space \":\" space string\nfeedback-rest ::= ( \",\" space result-kv )?\nresult-kv ::= \"\\\"result\\\"\" space \":\" space boolean\nroot ::= \"{\" space (feedback-kv feedback-rest | result-kv )? \"}\" space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
"samplers": [
"top_k",
"tfs_z",
"typical_p",
"top_p",
"min_p",
"temperature"
]
},
"prompt": "### Response to evaluate:\n\nHello, world!\n\n### Score Rubrics:\n\n[Says hello]:\n\n- False: The response being evaluated does not meet the criterion described in the square brackets.\n- True: The response being evaluated does meet the criterion described in the square brackets.\n\n### Feedback:",
"truncated": false,
"stopped_eos": true,
"stopped_word": false,
"stopped_limit": false,
"stopping_word": "",
"tokens_cached": 109,
"timings": {
"prompt_n": 56,
"prompt_ms": 405.85,
"prompt_per_token_ms": 7.247321428571429,
"prompt_per_second": 137.98201305901193,
"predicted_n": 54,
"predicted_ms": 2668.323,
"predicted_per_token_ms": 49.41338888888889,
"predicted_per_second": 20.237430026274932
}
} Note that:
Do you know what might be causing this discrepancy? |
Beta Was this translation helpful? Give feedback.
-
Hey @giladgd I'd like to create a The prompt format is fairly simple: Also curious if this is the only adaptation necessary to use a model not already supported by this library or if any other work is necessary. |
Beta Was this translation helpful? Give feedback.
-
I was trying to use the new vulkan support. Here is the output of my
With a basic example I was getting.
I tried setting gpuLayers to 0 and 32 and still got the same issue. Here is the model I was attempting to use. If this is expected or if you need more info let me know. I didn't see anything in the documentation about having to set the contextSize. |
Beta Was this translation helpful? Give feedback.
-
with function calling #139 to get the same beahavior as this function call in python autogen https://github.com/scenaristeur/dady/blob/c239bdf9d8334e719730eb5b4f46ea3d844ca62b/llm/basic_functions_with%20results.py#L82 with many / optionals params ? what is the norme to define params ?
|
Beta Was this translation helpful? Give feedback.
-
I'm closing this thread as version 3 is now released. |
Beta Was this translation helpful? Give feedback.
-
Please share here any feedback you have for the beta of version 3.0
Beta Was this translation helpful? Give feedback.
All reactions