Which parts are missing to support doing advanced stuff like this chunking strategy? #392

Madd0g · 2024-12-05T08:32:47Z

Madd0g
Dec 5, 2024

I came across this ZeroEntropy-AI/llama-chunk: A new chunking strategy developed by ZeroEntropy for general semantic chunking using Llama-70B.

I'm wondering what's missing from node-llama-cpp to support these kind of advanced use-cases.

Also, I'm not an LLM expert so I'm not fully understanding this sentence:

However, by inferencing llama locally, we have a vastly more efficient way of doing this! We can simply pass in the entire paragraph, and check the logprobs to see the probability that Llama wanted to output a "段" token at that location!

What does it mean "simply pass"? Are logprobs generated for inputs? I thought it was for outputs

Sorry for noob questions

Answered by giladgd

Dec 7, 2024

Thanks for sharing; this is brilliant!

I recommend you to read this documentation to get more background about how generation works, but in essence, the model generates a probability for each token in the vocabulary to be the next token for a given sequence of tokens, and while response generation for a prompt is an iterative process, it's technically possible to generate probabilities for all sequence lengths in parallel, which would enable a massive speedup in performance.

I think it's worth having some API for these kinds of use cases in node-llama-cpp, but I don't think it's a good idea to expose the entire logprobs and other related low-level APIs for this as it would affect the perf…

View full answer

giladgd · 2024-12-07T00:47:12Z

giladgd
Dec 7, 2024
Maintainer

Thanks for sharing; this is brilliant!

I recommend you to read this documentation to get more background about how generation works, but in essence, the model generates a probability for each token in the vocabulary to be the next token for a given sequence of tokens, and while response generation for a prompt is an iterative process, it's technically possible to generate probabilities for all sequence lengths in parallel, which would enable a massive speedup in performance.

I think it's worth having some API for these kinds of use cases in node-llama-cpp, but I don't think it's a good idea to expose the entire logprobs and other related low-level APIs for this as it would affect the performance and make it harder to maintain node-llama-cpp as the API of llama.cpp changed a lot in the past few months, so exposing such APIs would incur way too many breaking changes.

I'll look into implementing some API for this chunking strategy.

5 replies

Madd0g Dec 7, 2024
Author

it's technically possible to generate probabilities for all sequence lengths in parallel

so you don't have to re-gen all that text with the model - you just get the probabilities on existing text WHILE ALSO passing an instruction to insert a character? I thought I understood the iterative one-token-at-a-time nature, but this confused me.

I'll look into implementing some API for this chunking strategy.

that would be awesome

giladgd Dec 7, 2024
Maintainer

The iterative process is useful for generating coherent text where each token (let's say a word for this example) depends on the context of the text that precedes it.
We generate token by token to ensure the next token always fits well with the previous text.

Let's say we want to generate a completion for the text the quick brown where each word is a token, we would pass to the model the input [<token of "the">, <token of "quick">, <token of "brown">] which will generate the next token that would be fox, and then pass to the model [<token of "the">, <token of "quick">, <token of "brown">, <token of "fox">] to generate the next token, and so on.
However, we can do all of that simultaneously and get the next token of [<token of "the">], [<token of "the">, <token of "quick">] and [<token of "the">, <token of "quick">, <token of "brown">] at the same time.
There's a technical trick here that allows us to reuse the existing context state to do all of that in parallel instead of creating multiple contexts/sequences for that.
If the model generates a separator token for the [<token of "the">, <token of "quick">] input, then we know that there should be a separator after the word quick, even though we simultaneously generate a completion for [<token of "the">, <token of "quick">, <token of "brown">] that don't have the separator token after quick.

In this example, I omitted the probabilities detail, but it's important in the implementation since if the model sees a long text with no separators where those should have been placed, then it would decrease the probability of generating a separator next in the text, so we have to make a little adjustment to the probability of the separator token to be the next token.

Madd0g Dec 7, 2024
Author

huh, incredible explanation, thank you.

giladgd Jan 8, 2025
Maintainer

@Madd0g I've implemented this chunking strategy as an experimental function, please try it out and let me know what you think.
Since it's marked as experimental, it won't appear in the documentation, so make sure you install version 3.4.0 and import the experimentalChunkDocument function.
Take a look at this test for an example of how to use it.

It may require more tuning than what I've done to work well in all cases, so let me know if you've managed to improve its performance in any noticeable way.
If you want to try tuning the algorithm yourself, try modifying this calculation and experiment with different values of maxChunkSizeAlignmentCurve and normalizationTrailSize.

Madd0g Jan 8, 2025
Author

@giladgd Thank you, I will try to test a couple of flows during the weekend!

I looked at the default system prompt and the test - looks great!

zhzLuke96 · 2024-12-08T17:56:08Z

zhzLuke96
Dec 8, 2024

First of all, thanks a lot to @Madd0g for sharing llama-chunker! I found it super helpful and inspiring me. I’m working on a text-processing project, and the method you mentioned gave me some great ideas.

Hi, I’d like to share my use case here, which is somewhat related and might be relevant to this discussion.
I’m working on building an LLM evaluation automation system based on node-llama-cpp. For tasks like fill-in-the-blank questions (e.g., multiple-choice ABCD), having access to internal logprobs would be incredibly useful. This could allow us to calculate a quantifiable score in a single request instead of relying on multiple samples to average out the score.

I’m curious if the maintainer has any thoughts or suggestions on this?

3 replies

giladgd Dec 9, 2024
Maintainer

I'm not sure that exposing the logprobs is a good idea since moving a lot of information between the native code and the JS engine is expensive and can negatively impact the performance.
A solution for this would be to make this functionality optional, but my main concern with this is that the API of llama.cpp changed multiple times in the last year, and the current direction of the codebase leans towards using sampler chains that will be offloaded to the GPU for even better performance, and that would make it more complicated to maintain access to such information.
There are certain parts of llama.cpp that, if I had exposed them, would have incurred multiple breaking changes in the API of node-llama-cpp, so I prefer to expose high-level APIs to facilitate a certain functionality, which make it much more viable to maintain when breaking changes happen on llama.cpp.

Your use case sounds interesting. I'll think about a way to implement an API that would facilitate your use case.

zhzLuke96 Dec 9, 2024

thx response! I think I have a rough understanding of the issue. Also, thanks for your work—this library is really great to use.

I actually have an idea. From my perspective, node-llama-cpp could be split into two parts: one for the C++ binding and another for the JS API (apologies if I haven’t fully read through the code, so I’m not sure about the level of coupling between the two). My thought is, would it be possible to add advanced features only at the binding level, while keeping the JS API focused on higher-level abstractions? Do you think this could reduce the cost of syncing with upstream changes?

giladgd Jan 8, 2025
Maintainer

@zhzLuke96 I've done some refactoring to make it possible to opt for a more controlled evaluation that exposes more data at a slight performance cost when you use such APIs, but it makes it possible to achieve more advanced things that are otherwise impossible to do.
Take a look at the new low-level API documentation to check out the new APIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which parts are missing to support doing advanced stuff like this chunking strategy? #392

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Which parts are missing to support doing advanced stuff like this chunking strategy? #392

Madd0g Dec 5, 2024

Replies: 2 comments · 8 replies

giladgd Dec 7, 2024 Maintainer

Madd0g Dec 7, 2024 Author

giladgd Dec 7, 2024 Maintainer

Madd0g Dec 7, 2024 Author

giladgd Jan 8, 2025 Maintainer

Madd0g Jan 8, 2025 Author

zhzLuke96 Dec 8, 2024

giladgd Dec 9, 2024 Maintainer

zhzLuke96 Dec 9, 2024

giladgd Jan 8, 2025 Maintainer

Madd0g
Dec 5, 2024

Replies: 2 comments 8 replies

giladgd
Dec 7, 2024
Maintainer

Madd0g Dec 7, 2024
Author

giladgd Dec 7, 2024
Maintainer

Madd0g Dec 7, 2024
Author

giladgd Jan 8, 2025
Maintainer

Madd0g Jan 8, 2025
Author

zhzLuke96
Dec 8, 2024

giladgd Dec 9, 2024
Maintainer

giladgd Jan 8, 2025
Maintainer