vLLM compatibility ? #1312

Naatyu · 2023-11-15T10:29:32Z

Naatyu
Nov 15, 2023

Hi,

I currently use vLLM for other services, and I am deeply interested in connecting your extension with a vLLM server. Do you think it's possible? I'm using the OpenAI API format, so if there is a possibility to connect the extension to any server using the OpenAI API, that would be great.

Thanks for the awesome work!

Please reply with a 👍 if you want this feature.

wsxiaoys · 2023-11-16T00:39:48Z

wsxiaoys
Nov 16, 2023
Maintainer

We explored the direction but ultimately decided against pursuing it. This decision was influenced by the fact that most OpenAI-like solutions lack control over decoding steps, a crucial component that could use code completion specific optimization. For instance, handling long lists of stop words and applying grammar constraints for each decoding step becomes challenging.

Revisiting this approach might be viable if a decoding step-level API becomes widely adopted in the future.

0 replies

d0rc · 2023-11-16T05:49:41Z

d0rc
Nov 16, 2023

Do you think it's so hard and complex so it's not even a thing to formulate it as protocol? Like Tabby Inference Protocol, which should be either supported, or inference engine is not compatible...? If you will pick the protocol which is ok for your code, it shouldn't be hard to support it, and even for Tabby's codebase in a long run it may become beneficial.

0 replies

wsxiaoys · 2023-11-16T06:34:51Z

wsxiaoys
Nov 16, 2023
Maintainer

:) It's not so much about complexity as it is about capability. With an interface like OpenAI, we've relinquished control over accessing intermediate decoding steps. Many optimizations, upon which the current tabby relies, cannot be easily implemented—for instance, a lengthy stop words list.

0 replies

d0rc · 2023-11-17T16:14:34Z

d0rc
Nov 17, 2023

I see no problem implementing even very long stop list dictionary in my setup, even long list of long stop words with optimal stopping/lookup, I've been through "a lot" with open models😂

0 replies

d0rc · 2023-11-17T16:15:37Z

d0rc
Nov 17, 2023

I can even provide extended methods support, like forcing response to be json etc.

0 replies

Naatyu · 2023-11-17T18:19:45Z

Naatyu
Nov 17, 2023
Author

Thank you for the explanations

We explored the direction but ultimately decided against pursuing it. This decision was influenced by the fact that most OpenAI-like solutions lack control over decoding steps, a crucial component that could use code completion specific optimization. For instance, handling long lists of stop words and applying grammar constraints for each decoding step becomes challenging.

Revisiting this approach might be viable if a decoding step-level API becomes widely adopted in the future.

Thank you for the explanations

0 replies

sundaraa-deshaw · 2023-11-22T03:51:51Z

sundaraa-deshaw
Nov 22, 2023

Continuing from #854. Currently Tabby supports llama.cpp bindings and http bindings to vertex-ai and fastchat. Are there plans to support other bindings like OpenAI endpoints via HTTP or similar protocol.?

Thanks for the responses.

0 replies

wsxiaoys · 2023-11-22T03:53:21Z

wsxiaoys
Nov 22, 2023
Maintainer

Hey @sundaraa-deshaw , #795 (comment) explains the reason that why we don’t want a openai like http interface

0 replies

sundaraa-deshaw · 2023-11-22T03:57:42Z

sundaraa-deshaw
Nov 22, 2023

Thanks, I was wondering if we can have a binding to exllama[v2] inference engine like how it is done for llama.cpp today?

0 replies

wsxiaoys · 2023-11-22T04:01:57Z

wsxiaoys
Nov 22, 2023
Maintainer

That’s possible - the trait is defined at https://github.com/TabbyML/tabby/blob/main/crates/tabby-inference/src/lib.rs

Could you share some of your findings that exllama has advantage against llama.cpp?

0 replies

sundaraa-deshaw · 2023-11-22T10:41:44Z

sundaraa-deshaw
Nov 22, 2023

Thanks, are there plans to add such a binding?

exllama turned out to be good for inference on GPU, compared to llama.cpp on CPU.

The memory usage for a GPTQ quantized model was 2-3x less than running the non-quantized model (Llama 13B) on llama.cpp on GPU.
The performance (in terms of tokens/second) was 1.5-2x higher compare to llama.cpp.

0 replies

r7l · 2023-11-26T10:56:27Z

r7l
Nov 26, 2023

Since Tabby seems to supports Fastchat, would it be possible to support Ollama HTTP bindings? They have a decent list of integrations already. Ollama is also using llama.cpp under the hood.

0 replies

wsxiaoys · 2023-11-26T11:04:43Z

wsxiaoys
Nov 26, 2023
Maintainer

Fastchat isn't supported; it's a part of the exploration mentioned in an earlier reply and was eventually abandoned due to the reasons discussed above (lack of control during decoding).

0 replies

r7l · 2023-11-26T11:09:20Z

r7l
Nov 26, 2023

@wsxiaoys Thanks and sorry. I was mislead by the fastchat.rs file in the repo. Thought this would support for it somehow.

0 replies

wsxiaoys · 2023-11-26T11:11:01Z

wsxiaoys
Nov 26, 2023
Maintainer

No problem - it's not compiled to tabby by default (behind a feature flag), and left as a reference.

0 replies

MehrCurry · 2024-01-22T06:29:30Z

MehrCurry
Jan 22, 2024

Hi, i followed the discussion but could not exactly figure out what it means to me. I have Codellama running in the Cloud and want to connect Tabby to it. Is there a way to do so or do i have to use Tabby Server with a local GPU/CPU?

0 replies

wsxiaoys · 2024-01-22T19:19:39Z

wsxiaoys
Jan 22, 2024
Maintainer

Hey @MehrCurry , the short answer is no. Tabby comes with its own inference stack. You could deploy tabby into a cloud GPU (we have several tutorial on this, e.g https://tabby.tabbyml.com/docs/installation/hugging-face/).

0 replies

nathaniel-brough · 2024-01-26T19:27:27Z

nathaniel-brough
Jan 26, 2024

@wsxiaoys it is doable to modify the stop words for each model file in ollama. I'm only just learning about stop-words now, and I only have a surface level understanding of the tabbyml inference stack. So I'm not suggesting that the ollama configuration is feature complete enough to plugin into the tabbml stack. But it might be?

0 replies

nathaniel-brough · 2024-01-26T19:37:45Z

nathaniel-brough
Jan 26, 2024

It looks like it's also possible to modify stop words on the fly using the ollama API rather than just the model files.

0 replies

nathaniel-brough · 2024-01-26T19:51:41Z

nathaniel-brough
Jan 26, 2024

It also looks like GBNF grammar support is in the works in ollama. Are there any other dealbreakers beyond grammar/stop-word support?

0 replies

wsxiaoys · 2024-01-27T00:23:23Z

wsxiaoys
Jan 27, 2024
Maintainer

It looks like it's also possible to modify stop words on the fly using the ollama API rather than just the model files.

If I understand correctly, Ollama is essentially just a wrapper around Llama.cpp's server API, which, in turn, utilizes this stop words implementation:

https://github.com/ggerganov/llama.cpp/blob/a1d6df129bcd3d42cda38c09217d8d4ec4ea3bdd/examples/server/server.cpp#L766

As far as I know, it operates in O(N) time complexity, where N equals the number of stop words. Feel free to give it a try to see how the decoding performs with a stop sequence of approximately 20, for example, like the one below. (Hint: It will be slow, as is any implementation that supports a dynamic stop word list.)

tabby/crates/tabby-common/assets/languages.toml

Line 176 in c83cc41

top_level_keywords = [

0 replies

samos123 · 2024-12-09T05:25:32Z

samos123
Dec 9, 2024

TabbyML supports vLLM already: https://tabby.tabbyml.com/docs/references/models-http-api/vllm/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM compatibility ? #1312

{{title}}

Replies: 22 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

vLLM compatibility ? #1312

Replies: 22 comments

wsxiaoys Nov 16, 2023 Maintainer

wsxiaoys Nov 16, 2023 Maintainer

Naatyu Nov 17, 2023 Author

wsxiaoys Nov 22, 2023 Maintainer

wsxiaoys Nov 22, 2023 Maintainer

wsxiaoys Nov 26, 2023 Maintainer

wsxiaoys Nov 26, 2023 Maintainer

wsxiaoys Jan 22, 2024 Maintainer

wsxiaoys Jan 27, 2024 Maintainer

wsxiaoys
Nov 16, 2023
Maintainer

wsxiaoys
Nov 16, 2023
Maintainer

Naatyu
Nov 17, 2023
Author

wsxiaoys
Nov 22, 2023
Maintainer

wsxiaoys
Nov 22, 2023
Maintainer

wsxiaoys
Nov 26, 2023
Maintainer

wsxiaoys
Nov 26, 2023
Maintainer

wsxiaoys
Jan 22, 2024
Maintainer

wsxiaoys
Jan 27, 2024
Maintainer