Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vLLM): support async generator #746

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

fengyizhu
Copy link

@fengyizhu fengyizhu commented Sep 9, 2024

I have provided a draft version of the VLLM, with the following key changes:

  1. Protocol changes: Aligned with OpenAI /v1/audio/speech.
  2. Removed the custom inference part, keeping the inference logic consistent between streaming and non-streaming.
  3. Optimized some potential issues of inconsistency caused by streaming inference.

These changes may have a significant impact. Feel free to leave comments to guide me in further improvements.

@github-actions github-actions bot changed the base branch from main to dev September 9, 2024 02:09
Copy link
Member

@fumiama fumiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please note that vLLM cannot and will never replace the original infer code.
  • You should open a new PR for the change of examples/api but not merge them together.

ChatTTS/model/gpt.py Outdated Show resolved Hide resolved
ChatTTS/model/gpt.py Outdated Show resolved Hide resolved
ChatTTS/core.py Show resolved Hide resolved
@fumiama fumiama added enhancement New feature or request algorithm Algorithm improvements & issues performance Running speed & quality labels Sep 9, 2024
Copy link
Member

@fumiama fumiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change the default behavior unless you must do it. If you have the reason of changing the behavior, please explain.

Note: you shouldn't refer the content written in vLLM code and port those contents into main, because the code in vLLM is contributed by the community and hasn't been fully checked.

ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
@IrisSally
Copy link
Contributor

这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了

@ZaymeShaw
Copy link
Contributor

感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640

@fengyizhu
Copy link
Author

fengyizhu commented Sep 9, 2024

感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640

It is fully compatible in principle.

@LLongIsland
Copy link

这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了

4090上使用compile=True耗时3.3s的文本,用main分支vllm加速是1.6s,不能调整音色,用这个pr的版本可以调整,但速度只有2.6s左右

Copy link
Member

@fumiama fumiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me say it again. Your modification on examples/api is about another function so you SHOULD move those changes to ANOTHER PR. Thanks for your comprehension.

ChatTTS/core.py Outdated
@@ -272,7 +271,7 @@ def _load(
vq_config=asdict(self.config.dvae.vq),
dim=self.config.dvae.decoder.idim,
coef=coef,
device=device,
device=self.device,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change it.

ChatTTS/core.py Outdated
@@ -289,8 +288,8 @@ def _load(
self.config.embed.num_text_tokens,
self.config.embed.num_vq,
)
embed.from_pretrained(embed_path, device=device)
self.embed = embed.to(device)
embed.from_pretrained(embed_path, device=self.device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change it.

ChatTTS/core.py Outdated
embed.from_pretrained(embed_path, device=device)
self.embed = embed.to(device)
embed.from_pretrained(embed_path, device=self.device)
self.embed = embed.to(self.device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change it.

# Filter both rows and columns using slicing
yield new_wavs[:][:, keep_cols]
# Hacker:Check if there are any silent segments; if so, take the last segment. Otherwise, try waiting for another loop.
import librosa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to introduce librosa. Modifying the original code makes sense. ex. replace

keep_cols = np.sum(new_wavs != 0, axis=0) > 0

with

# pseudo code without testing, just a hint
keep_cols = np.sum(abs(new_wavs) > 1e-6, axis=0) > 0

ChatTTS/core.py Outdated
None,
sample_params,
input_ids,
results_generator = gpt.llm.llm_engine.generate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so good to call into the inner method and you should modify the outer one gpt.llm.generate.

),
]

emb = self.embed(input_ids, text_mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move a line up, don't change the variable name when it's ok to remain the original one. It will make it difficult to see your changes.

ChatTTS/core.py Outdated
attentions=[],
)
else:
results_generator = gpt.generate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Copy link
Member

@fumiama fumiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that you got some trouble moving your modification of examples/api and tools/audio into another PR 😂. Here's a brief instruction.

  1. Add a new commit in this PR, removing examples/api/main.py and tools/audio/np.py
  2. Add a new branch based on the dev branch of THIS MAIN REPO.
  3. Apply your modification of examples/api/main.py and tools/audio/np.py on this new branch.
  4. Create a new PR based on this branch to dev.

@fumiama fumiama changed the title feat:support async vllm generator feat(vLLM): support async generator Sep 12, 2024
@fumiama fumiama mentioned this pull request Sep 15, 2024
@niuzheng168
Copy link
Contributor

niuzheng168 commented Sep 19, 2024

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm.
Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

@fumiama
Copy link
Member

fumiama commented Sep 19, 2024

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm.

Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

This part of code is contributed by the community and you can ask @ylzz1997 about that. I'm sorry but I'm not familiar with vLLM 😂.

@ylzz1997
Copy link
Contributor

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algorithm Algorithm improvements & issues enhancement New feature or request good first issue Good for newcomers performance Running speed & quality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants