-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch infer] Update batch inference template to use RayLLMBatch #346
Conversation
DON' MERGE, and don't delete the branch. |
templates/batch-llm/README.md
Outdated
prompt_for_hugging_face_token, | ||
) | ||
``` | ||
RayLLM-Batch is a library for running batch inference for LLMs using Ray Data for data processing, and defines an easy and flexible interface for the user to define their own workload. In this tutorial, we will implement a workload based on the [`CNNDailyMail`](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset, which is a collection of news articles. And we will summarize each article with our batch inferencing pipeline. We will cover more details on how to customize the workload in the later sections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
RE compute config change, the previous seems to have different variations of g5 and l4 instance types. I guess it helps with finding the gpus. Not sure if this change will make it harder to find gpu and slower to start the cluster |
@shomilj to review the computer config. @angelinalg if u want to take a look at the content. |
name: head | ||
# TODO(ricky): We need head node to have CUDA due to eager import from rayllm_batch now. | ||
instance_type: g5.xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we generally don't want to encourage the pattern of using GPU head nodes --> since it leads to people running code on head nodes that take down the workload due to oom (antipattern)
can you use a CPU head node instead with scheduling disabled (see basic serverless config that existing templates use) and wrap your code in an actor or something that runs on workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense - instead of temp fixing this by wrapping in actor, I think we will fix this by addressing the root cause soon, that is to make this code runnable on CPU itself (which it should be, it's just we are not lazy importing vllm as of now).
Co-authored-by: Huaiwei Sun <scottsun94@gmail.com>
|
||
For a Python script version of the code in this workspace template, refer to `main.py`. | ||
<!-- TODO: add a link for the RayLLM-Batch API reference --> | ||
This template shows you how to run batch inference for LLMs using RayLLM-Batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will you be including/updating main.py
in this PR as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't want to run anyscale submit job for now (because the log outputs are just not properly handled yet) so I didn't add it.
I was actually planning to remove the main.py - good catch. I am open to have the main.py as well, no strong preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to remove, yeah the main point of the main.py
was to run as a job, and also allow the user to run the notebook code as one continuous script.
templates/batch-llm/README.md
Outdated
} | ||
from rayllm_batch import init_engine_from_config | ||
# Read the model configs from the path. | ||
model_config_path = "configs/llama-3.1-8b-a10g.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add code to auto-detect if the user is on aws or gcp, and choose the right config accordingly? (e.g. same logic in _on_gcp_cloud()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice - that would be great actually. Let me do that.
Addressed comments and fix GCP compute not found. --------- Co-authored-by: rickyx <rickyx@anysale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We will have a follow-up to support CPU head node by lazily importing vLLM.
Don't delete the branch for now. |
Update the current batch llm inference template to use RayLLM-Batch