Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch infer] Update batch inference template to use RayLLMBatch #346

Merged
merged 7 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 3 additions & 83 deletions configs/batch-llm/aws.yaml
Original file line number Diff line number Diff line change
@@ -1,84 +1,4 @@
head_node_type:
name: head-node
instance_type: m5.2xlarge
resources:
cpu: 0
worker_node_types:
- name: worker-g5-xlarge-nvidia-a10-1
instance_type: g5.xlarge
resources:
custom_resources:
"accelerator_type:A10G": 1
min_workers: 0
max_workers: 4
use_spot: true
fallback_to_ondemand: true
- name: worker-g5-2xlarge-nvidia-a10-1
instance_type: g5.2xlarge
resources:
custom_resources:
"accelerator_type:A10G": 1
min_workers: 0
max_workers: 4
use_spot: true
fallback_to_ondemand: true
- name: worker-g5-4xlarge-nvidia-a10-1
instance_type: g5.4xlarge
resources:
custom_resources:
"accelerator_type:A10G": 1
min_workers: 0
max_workers: 4
use_spot: true
fallback_to_ondemand: true
- name: worker-g5-8xlarge-nvidia-a10-1
instance_type: g5.8xlarge
resources:
custom_resources:
"accelerator_type:A10G": 1
min_workers: 0
max_workers: 4
use_spot: true
fallback_to_ondemand: true
- name: worker-g5-12xlarge-nvidia-a10-4
instance_type: g5.12xlarge
resources:
custom_resources:
"accelerator_type:A10G": 1
min_workers: 0
max_workers: 1
use_spot: true
fallback_to_ondemand: true
- name: worker-g5-16xlarge-nvidia-a10-1
instance_type: g5.16xlarge
resources:
custom_resources:
"accelerator_type:A10G": 1
min_workers: 0
max_workers: 4
use_spot: true
fallback_to_ondemand: true
- name: worker-g5-24xlarge-nvidia-a10-4
instance_type: g5.24xlarge
resources:
custom_resources:
"accelerator_type:A10G": 1
min_workers: 0
max_workers: 1
use_spot: true
fallback_to_ondemand: true
- name: worker-g5-48xlarge-nvidia-a10-8
instance_type: g5.48xlarge
resources:
custom_resources:
"accelerator_type:A10G": 1
min_workers: 0
max_workers: 1
use_spot: true
fallback_to_ondemand: true
aws:
TagSpecifications:
- ResourceType: instance
Tags:
- Key: as-feature-multi-zone
Value: "true"
name: head
# TODO(ricky): We need head node to have CUDA due to eager import from rayllm_batch now.
instance_type: g5.xlarge
Comment on lines +2 to +4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we generally don't want to encourage the pattern of using GPU head nodes --> since it leads to people running code on head nodes that take down the workload due to oom (antipattern)

can you use a CPU head node instead with scheduling disabled (see basic serverless config that existing templates use) and wrap your code in an actor or something that runs on workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense - instead of temp fixing this by wrapping in actor, I think we will fix this by addressing the root cause soon, that is to make this code runnable on CPU itself (which it should be, it's just we are not lazy importing vllm as of now).

57 changes: 3 additions & 54 deletions configs/batch-llm/gcp.yaml
Original file line number Diff line number Diff line change
@@ -1,55 +1,4 @@
head_node_type:
name: head-node
instance_type: n2-standard-8
resources:
cpu: 0
worker_node_types:
- name: worker-g2standard4-nvidia-l4-1
instance_type: g2-standard-4-nvidia-l4-1
resources:
custom_resources:
"accelerator_type:L4": 1
min_workers: 0
max_workers: 50
use_spot: true
fallback_to_ondemand: true
- name: worker-g2standard8-nvidia-l4-1
instance_type: g2-standard-8-nvidia-l4-1
resources:
custom_resources:
"accelerator_type:L4": 1
min_workers: 0
max_workers: 50
use_spot: true
fallback_to_ondemand: true
- name: worker-g2standard12-nvidia-l4-1
instance_type: g2-standard-12-nvidia-l4-1
resources:
custom_resources:
"accelerator_type:L4": 1
min_workers: 0
max_workers: 50
use_spot: true
fallback_to_ondemand: true
- name: worker-g2standard16-nvidia-l4-1
instance_type: g2-standard-16-nvidia-l4-1
resources:
custom_resources:
"accelerator_type:L4": 1
min_workers: 0
max_workers: 50
use_spot: true
fallback_to_ondemand: true
- name: worker-g2standard24-nvidia-l4-2
instance_type: g2-standard-24-nvidia-l4-2
resources:
custom_resources:
"accelerator_type:L4": 1
min_workers: 0
max_workers: 25
use_spot: true
fallback_to_ondemand: true
gcp_advanced_configurations_json:
instance_properties:
labels:
as-feature-multi-zone: "true"
name: head
# TODO(ricky): We need head node to have CUDA due to eager import from rayllm_batch now.
instance_type: g2-standard-4-nvidia-l4-1
Loading
Loading