Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch infer] Update batch inference template to use RayLLMBatch #346

Merged
merged 7 commits into from
Oct 22, 2024

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Sep 24, 2024

Update the current batch llm inference template to use RayLLM-Batch

@rickyyx
Copy link
Contributor Author

rickyyx commented Sep 24, 2024

DON' MERGE, and don't delete the branch.

prompt_for_hugging_face_token,
)
```
RayLLM-Batch is a library for running batch inference for LLMs using Ray Data for data processing, and defines an easy and flexible interface for the user to define their own workload. In this tutorial, we will implement a workload based on the [`CNNDailyMail`](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset, which is a collection of news articles. And we will summarize each article with our batch inferencing pipeline. We will cover more details on how to customize the workload in the later sections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

@scottsun94
Copy link
Contributor

RE compute config change, the previous seems to have different variations of g5 and l4 instance types. I guess it helps with finding the gpus. Not sure if this change will make it harder to find gpu and slower to start the cluster

@scottsun94
Copy link
Contributor

@shomilj to review the computer config.

@angelinalg if u want to take a look at the content.

Comment on lines +2 to +4
name: head
# TODO(ricky): We need head node to have CUDA due to eager import from rayllm_batch now.
instance_type: g5.xlarge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we generally don't want to encourage the pattern of using GPU head nodes --> since it leads to people running code on head nodes that take down the workload due to oom (antipattern)

can you use a CPU head node instead with scheduling disabled (see basic serverless config that existing templates use) and wrap your code in an actor or something that runs on workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense - instead of temp fixing this by wrapping in actor, I think we will fix this by addressing the root cause soon, that is to make this code runnable on CPU itself (which it should be, it's just we are not lazy importing vllm as of now).

Co-authored-by: Huaiwei Sun <scottsun94@gmail.com>

For a Python script version of the code in this workspace template, refer to `main.py`.
<!-- TODO: add a link for the RayLLM-Batch API reference -->
This template shows you how to run batch inference for LLMs using RayLLM-Batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you be including/updating main.py in this PR as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't want to run anyscale submit job for now (because the log outputs are just not properly handled yet) so I didn't add it.

I was actually planning to remove the main.py - good catch. I am open to have the main.py as well, no strong preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to remove, yeah the main point of the main.py was to run as a job, and also allow the user to run the notebook code as one continuous script.

}
from rayllm_batch import init_engine_from_config
# Read the model configs from the path.
model_config_path = "configs/llama-3.1-8b-a10g.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add code to auto-detect if the user is on aws or gcp, and choose the right config accordingly? (e.g. same logic in _on_gcp_cloud())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice - that would be great actually. Let me do that.

rickyyx and others added 2 commits September 25, 2024 15:48
Addressed comments and fix GCP compute not found.

---------

Co-authored-by: rickyx <rickyx@anysale.com>
Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We will have a follow-up to support CPU head node by lazily importing vLLM.

@rickyyx rickyyx merged commit 7aec451 into main Oct 22, 2024
2 checks passed
@rickyyx
Copy link
Contributor Author

rickyyx commented Oct 22, 2024

Don't delete the branch for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants