Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin' into scottjlee/text-embeddings
Browse files Browse the repository at this point in the history
  • Loading branch information
scottjlee committed Feb 21, 2024
2 parents e3b375e + 5768dfd commit 6a67e6a
Show file tree
Hide file tree
Showing 93 changed files with 1,576 additions and 305 deletions.
14 changes: 14 additions & 0 deletions .github/workflows/pre-commit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: pre-commit

on:
pull_request:
push:
branches: [main]

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
- uses: pre-commit/action@v3.0.1
20 changes: 20 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
default_stages: [commit, push]
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
hooks:
- id: check-added-large-files
- id: trailing-whitespace
# README might be auto-generated
exclude: templates/.+/README.md
- id: end-of-file-fixer
# README might be auto-generated
exclude: templates/.+/README.md
- repo: local
hooks:
- id: generate-readme
name: Auto generate README.md from README.ipynb
entry: ci/auto-generate-readme.sh
language: script
pass_filenames: false
dependencies: [jupyter]
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ These templates are a set of minimal examples & tutorials for customers to run o

If the template is generic to Ray & Ray libraries, please consider adding the template in: https://github.com/ray-project/ray/tree/master/doc/source/templates

To setup the environment:
1. Install pre-commit `pip install pre-commit`
2. Install the git hook scripts `pre-commit install`


To add a template:

Expand All @@ -23,10 +27,8 @@ To add a template:
Your template does not need to be a Jupyter notebook. It can also be presented as a
Python script.

All templates MUST have a `README.md` file.
All templates MUST have a `README.md` or a `README.ipynb` file.

2. Add your compute configuration under `configs/<your-template-name>` (for both AWS and GCE).

3. Update the product repo `backend/workspace-templates.yaml` to point to the new template added here after being merged.

4. Coming soon: update the integration tests in the product repo.
31 changes: 31 additions & 0 deletions ci/auto-generate-readme.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

# Search for notebook files named "demo" in the ../templates directory
notebook_files=$(find ../templates -name "README.ipynb")

# Loop through each notebook file
for notebook_file in $notebook_files; do
# Convert notebook file to README.md using nbconvert
jupyter nbconvert --to markdown "$notebook_file" --output-dir "$(dirname "$notebook_file")"
done

# Define the repo prefix
REPO_PREFIX="https://raw.githubusercontent.com/anyscale/templates/main"

# Search for README.md in the ../templates directory
readme_files=$(find ../templates -name "README.md")

# Loop through each readme files
for readme_file in $readme_files; do
# Extract the path of the directory containing the README file, relative to the repository root
readme_dir=$(dirname "$readme_file" | sed "s|\.\./templates/||")

# Check the operating system
if [[ "$OSTYPE" == "darwin"* ]]; then
# macOS system
sed -i '' "s|<img src=\"\([^\"http://][^\":/][^\"].*\)\"|<img src=\"${REPO_PREFIX}/${readme_dir}/\1\"|g" "$readme_file"
else
# Assuming Linux
sed -i "s|<img src=\"\([^\"http://][^\":/][^\"].*\)\"|<img src=\"${REPO_PREFIX}/${readme_dir}/\1\"|g" "$readme_file"
fi
done
2 changes: 1 addition & 1 deletion configs/anyscale-ray-101/aws.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ worker_node_types:
instance_type: m5.2xlarge
min_workers: 0
max_workers: 2
use_spot: false
use_spot: false
2 changes: 1 addition & 1 deletion configs/anyscale-ray-101/gce.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ worker_node_types:
instance_type: n2-standard-8
min_workers: 0
max_workers: 2
use_spot: false
use_spot: false
8 changes: 4 additions & 4 deletions configs/endpoints/gcp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ worker_node_types:
max_workers: 100
use_spot: true
fallback_to_ondemand: true
- name: gpu-worker-a100-40g-1
- name: gpu-worker-a100-40g-1
instance_type: a2-highgpu-1g-nvidia-a100-40gb-1
resources:
cpu:
Expand All @@ -98,7 +98,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-2
- name: gpu-worker-a100-40g-2
instance_type: a2-highgpu-2g-nvidia-a100-40gb-2
resources:
cpu:
Expand All @@ -109,7 +109,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-4
- name: gpu-worker-a100-40g-4
instance_type: a2-highgpu-4g-nvidia-a100-40gb-4
resources:
cpu:
Expand All @@ -120,7 +120,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-8
- name: gpu-worker-a100-40g-8
instance_type: a2-highgpu-8g-nvidia-a100-40gb-8
resources:
cpu:
Expand Down
8 changes: 4 additions & 4 deletions configs/endpoints_v2/gcp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ worker_node_types:
max_workers: 100
use_spot: true
fallback_to_ondemand: true
- name: gpu-worker-a100-40g-1
- name: gpu-worker-a100-40g-1
instance_type: a2-highgpu-1g-nvidia-a100-40gb-1
resources:
cpu:
Expand All @@ -98,7 +98,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-2
- name: gpu-worker-a100-40g-2
instance_type: a2-highgpu-2g-nvidia-a100-40gb-2
resources:
cpu:
Expand All @@ -109,7 +109,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-4
- name: gpu-worker-a100-40g-4
instance_type: a2-highgpu-4g-nvidia-a100-40gb-4
resources:
cpu:
Expand All @@ -120,7 +120,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-8
- name: gpu-worker-a100-40g-8
instance_type: a2-highgpu-8g-nvidia-a100-40gb-8
resources:
cpu:
Expand Down
2 changes: 1 addition & 1 deletion configs/fine-tune-GPTJ/aws.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ worker_node_types:
instance_type: g4dn.4xlarge
min_workers: 0
max_workers: 15
use_spot: false
use_spot: false
2 changes: 1 addition & 1 deletion configs/fine-tune-GPTJ/gce.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ worker_node_types:
instance_type: n1-standard-16-nvidia-t4-16gb-1
min_workers: 1
max_workers: 15
use_spot: false
use_spot: false
2 changes: 1 addition & 1 deletion configs/fine-tune-llama2/aws.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ worker_node_types:
instance_type: g5.4xlarge
min_workers: 0
max_workers: 100
use_spot: false
use_spot: false
2 changes: 1 addition & 1 deletion configs/fine-tune-llama2/gce.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ worker_node_types:
instance_type: g2-standard-16-nvidia-l4-1
min_workers: 0
max_workers: 100
use_spot: false
use_spot: false
4 changes: 4 additions & 0 deletions configs/fine-tune-llm/aws.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
head_node_type:
name: head-node-type
instance_type: m5.xlarge
worker_node_types: []
4 changes: 4 additions & 0 deletions configs/fine-tune-llm/gce.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
head_node_type:
name: head_node_type
instance_type: n1-standard-4
worker_node_types: []
1 change: 1 addition & 0 deletions configs/intro-workspaces/aws.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
head_node_type:
name: head_node_type
instance_type: m5.xlarge
worker_node_types: []
1 change: 1 addition & 0 deletions configs/intro-workspaces/gce.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
head_node_type:
name: head_node_type
instance_type: n2-standard-4
worker_node_types: []
8 changes: 4 additions & 4 deletions configs/serve-stable-diffusion-aica/gcp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ worker_node_types:
max_workers: 100
use_spot: true
fallback_to_ondemand: true
- name: gpu-worker-a100-40g-1
- name: gpu-worker-a100-40g-1
instance_type: a2-highgpu-1g-nvidia-a100-40gb-1
resources:
cpu:
Expand All @@ -98,7 +98,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-2
- name: gpu-worker-a100-40g-2
instance_type: a2-highgpu-2g-nvidia-a100-40gb-2
resources:
cpu:
Expand All @@ -109,7 +109,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-4
- name: gpu-worker-a100-40g-4
instance_type: a2-highgpu-4g-nvidia-a100-40gb-4
resources:
cpu:
Expand All @@ -120,7 +120,7 @@ worker_node_types:
"accelerator_type:A100-40G": 1
min_workers: 0
max_workers: 100
- name: gpu-worker-a100-40g-8
- name: gpu-worker-a100-40g-8
instance_type: a2-highgpu-8g-nvidia-a100-40gb-8
resources:
cpu:
Expand Down
1 change: 1 addition & 0 deletions configs/serve-stable-diffusion/aws.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ worker_node_types:
min_workers: 0
max_workers: 100
use_spot: false
auto_select_worker_config: true
7 changes: 4 additions & 3 deletions configs/serve-stable-diffusion/gce.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# n1-standard-8-nvidia-tesla-t4-1 --> 8 CPUs, 1 GPU
# n1-standard-8-nvidia-t4-16gb-1 --> 8 CPUs, 1 GPU
head_node_type:
name: head_node_type
instance_type: n1-standard-8-nvidia-tesla-t4-1
instance_type: n1-standard-8-nvidia-t4-16gb-1
worker_node_types:
- name: gpu_worker
instance_type: n1-standard-8-nvidia-tesla-t4-1
instance_type: n1-standard-8-nvidia-t4-16gb-1
min_workers: 0
max_workers: 100
use_spot: false
auto_select_worker_config: true
12 changes: 6 additions & 6 deletions templates/endpoints/AdvancedModelConfigs.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ Each model is defined by a YAML configuration file in the `models` directory.
## Modify an existing model

To modify an existing model, simply edit the YAML file for that model.
Each config file consists of three sections:
Each config file consists of three sections:

- `deployment_config`,
- `engine_config`,
- `deployment_config`,
- `engine_config`,
- `scaling_config`.

It's best to check out examples of existing models to see how they are configured.
Expand All @@ -24,7 +24,7 @@ and specifies how to [auto-scale the model](https://docs.ray.io/en/latest/serve/
* `max_concurrent_queries` - Maximum number of queries that a Ray Serve replica can process at a time. Additional queries are queued at the proxy.
* `target_num_ongoing_requests_per_replica` - Guides the auto-scaling behavior. If the average number of ongoing requests across replicas is above this number, Ray Serve attempts to scale up the number of replicas, and vice-versa for downscaling. We typically set this to ~40% of the `max_concurrent_queries`.
* `ray_actor_options` - Similar to the `resources_per_worker` configuration in the `scaling_config`. Refer to the `scaling_config` section for more guidance.
* `smoothing_factor` - The multiplicative factor to amplify or moderate each upscaling or downscaling decision. A value less than 1.0 will slow down the scaling decision made in each step. See [advanced auto-scaling guide](https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#optional-define-how-the-system-reacts-to-changing-traffic) for more details.
* `smoothing_factor` - The multiplicative factor to amplify or moderate each upscaling or downscaling decision. A value less than 1.0 will slow down the scaling decision made in each step. See [advanced auto-scaling guide](https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#optional-define-how-the-system-reacts-to-changing-traffic) for more details.

## Engine config

Expand All @@ -36,7 +36,7 @@ RayLLM supports continuous batching, meaning incoming requests are processed as

* `model_id` is the ID that refers to the model in the RayLLM or OpenAI API.
* `type` is the type of inference engine. Only `VLLMEngine` is currently supported.
* `engine_kwargs` and `max_total_tokens` are configuration options for the inference engine (e.g. gpu memory utilization, quantization, max number of concurrent sequences). These options may vary depending on the hardware accelerator type and model size. We have tuned the parameters in the configuration files included in RayLLM for you to use as reference.
* `engine_kwargs` and `max_total_tokens` are configuration options for the inference engine (e.g. gpu memory utilization, quantization, max number of concurrent sequences). These options may vary depending on the hardware accelerator type and model size. We have tuned the parameters in the configuration files included in RayLLM for you to use as reference.
* `generation` contains configurations related to default generation parameters such as `prompt_format` and `stopping_sequences`.
* `hf_model_id` is the Hugging Face model ID. If not specified, defaults to `model_id`.
* `runtime_env` is a dictionary that contains Ray runtime environment configuration. It allows you to set per-model pip packages and environment variables. See [Ray documentation on Runtime Environments](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) for more information.
Expand All @@ -51,7 +51,7 @@ Finally, the `scaling_config` section specifies what resources should be used to
* `num_gpus_per_worker` - Number of GPUs to be allocated per worker. This should always be 1.
* `num_cpus_per_worker` - Number of CPUs to be allocated per worker. Usually set to 8.
* `placement_strategy` - Ray supports different [placement strategies](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html#placement-strategy) for guiding the physical distribution of workers. To ensure all workers are on the same node, use "STRICT_PACK".
* `resources_per_worker` - we use `resources_per_worker` to set [Ray custom resources](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#id1) and place the models on specific node types. An example configuration of `resources_per_worker` involves setting `accelerator_type:L4` to 0.001 for a Llama-2-7b model to be deployed on an L4 GPU. This must always be set to 0.001. The `num_gpus_per_worker` configuration along with number of GPUs available on the node will determine the number of workers Ray schedules on the node. The supported accelerator types are: T4, L4, A10G, A100-40G and A100-80G.
* `resources_per_worker` - we use `resources_per_worker` to set [Ray custom resources](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#id1) and place the models on specific node types. An example configuration of `resources_per_worker` involves setting `accelerator_type:L4` to 0.001 for a Llama-2-7b model to be deployed on an L4 GPU. This must always be set to 0.001. The `num_gpus_per_worker` configuration along with number of GPUs available on the node will determine the number of workers Ray schedules on the node. The supported accelerator types are: T4, L4, A10G, A100-40G and A100-80G.

## My deployment isn't starting/working correctly, how can I debug?

Expand Down
6 changes: 3 additions & 3 deletions templates/endpoints/CustomModels.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Adding a new model

RayLLM supports fine-tuned versions of models in the `models` directory as well as model architectures supported by [vLLM](https://docs.vllm.ai/en/latest/models/supported_models.html). You can either bring a model from HuggingFace or artifact storage like S3, GCS.
RayLLM supports fine-tuned versions of models in the `models` directory as well as model architectures supported by [vLLM](https://docs.vllm.ai/en/latest/models/supported_models.html). You can either bring a model from HuggingFace or artifact storage like S3, GCS.

## Configuring a new model

To add an entirely new model to the zoo, you will need to create a new YAML file.
This file should follow the naming convention
This file should follow the naming convention
`<organisation-name>--<model-name>-<model-parameters>-<extra-info>.yaml`. We recommend using one of the existing models as a template (ideally, one that is the same architecture and number of parameters as the model you are adding). The examples in the `models` directory should help you get started. You can look at the [Advanced Model Configs](./AdvancedModelConfigs.md) for more details on these configurations.

```yaml
Expand Down Expand Up @@ -75,7 +75,7 @@ scaling_config:

```

## Adding a private model
## Adding a private model

For loading a model from S3 or GCS, set `engine_config.s3_mirror_config.bucket_uri` or `engine_config.gcs_mirror_config.bucket_uri` to point to a folder containing your model and tokenizer files (`config.json`, `tokenizer_config.json`, `.bin`/`.safetensors` files, etc.) and set `engine_config.model_id` to any ID you desire in the `organization/model` format, eg. `myorganization/llama2-finetuned`. The model will be downloaded to a folder in the `<TRANSFORMERS_CACHE>/models--<organization-name>--<model-name>/snapshots/<HASH>` directory on each node in the cluster. `<HASH>` will be determined by the contents of `hash` file in the S3 folder, or default to `0000000000000000000000000000000000000000`. See the [HuggingFace transformers documentation](https://huggingface.co/docs/transformers/main/en/installation#cache-setup).

Expand Down
4 changes: 2 additions & 2 deletions templates/endpoints/DeployFunctionCalling.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ For Example, you can see `models/mistral/mistralai--Mistral-7B-Instruct-v0.1_a10
enable_json_logits_processors: true
```

2. Set `standalone_function_calling_model: true` in top level configuration.
2. Set `standalone_function_calling_model: true` in top level configuration.

# Step 2 - Deploying & Querying Function calling model

`func_calling-serve.yaml` and `func_calling-query.py` are provided for you in this template.
`func_calling-serve.yaml` and `func_calling-query.py` are provided for you in this template.

In order to deploy a model in function calling mode you need to edit `func_calling-serve.yaml`:
Under `function_calling_models` add path to the model you want to use. You can add multiple model
Expand Down
4 changes: 2 additions & 2 deletions templates/endpoints/DeployLora.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Serving LoRA Models

We support serving multiple LoRA adapters with a common base model in the same request batch which allows you to serve a wide variety of use-cases without increasing hardware spend. In addition, we use Serve multiplexing to reduce the number of swaps for LoRA adapters. There is a slight latency overhead to serving a LoRA model compared to the base model, typically 10-20%.
We support serving multiple LoRA adapters with a common base model in the same request batch which allows you to serve a wide variety of use-cases without increasing hardware spend. In addition, we use Serve multiplexing to reduce the number of swaps for LoRA adapters. There is a slight latency overhead to serving a LoRA model compared to the base model, typically 10-20%.

# Setup LoRA Model Deployment

`lora-serve.yaml` and `lora-query.py` are provided for you in this template.
`lora-serve.yaml` and `lora-query.py` are provided for you in this template.

In order to deploy LoRA adapters you would need to update `lora-serve.yaml`:
1. `dynamic_lora_loading_path` - The LoRA checkpoints are loaded from the artifact storage path specified in `dynamic_lora_loading_path`. The path to the checkpoints must be in the following format: `{dynamic_lora_loading_path}/{base_model_id}:{suffix}:{id}`, e.g. `s3://my-bucket/my-lora-checkouts/meta-llama/Llama-2-7b-chat-hf:lora-model:1234`. The models can be loaded from any accessible AWS S3 or Google Cloud Storage bucket. You can use an existing bucket where you have the LoRA models or can upload the models to `$ANYSCALE_ARTIFACT_STORAGE` already provided by Anyscale Workspace. New models can be uploaded to the `dynamic_lora_loading_path` dynamically before or after the Serve application is launched.
Expand Down
4 changes: 2 additions & 2 deletions templates/endpoints/EmbeddingModels.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ We support serving embedding models available in HuggingFace as well as optimizi

# Setting up Model

See an example for serving embedding models in `embedding-serve.yaml`. Notably the `args` field in the yaml file needs to contain the `embedding_models` field. This field contains a list of YAML files (in the `models` directory) for the embedding models you want to deploy.
See an example for serving embedding models in `embedding-serve.yaml`. Notably the `args` field in the yaml file needs to contain the `embedding_models` field. This field contains a list of YAML files (in the `models` directory) for the embedding models you want to deploy.

In order to deploy an embedding model run:
```shell
Expand All @@ -21,7 +21,7 @@ python embedding-query.py

# Optimizing Embedding Models

We support optimizing embedding models with ONNX. In order to enable this, set the flag under `engine_config` in your model yaml file. See `models/embedding_models\BAAI--bge-large-en-v1.5.yaml` for an example.
We support optimizing embedding models with ONNX. In order to enable this, set the flag under `engine_config` in your model yaml file. See `models/embedding_models\BAAI--bge-large-en-v1.5.yaml` for an example.

```shell
engine_config:
Expand Down
Loading

0 comments on commit 6a67e6a

Please sign in to comment.