Merge remote-tracking branch 'origin' into scottjlee/text-embeddings

anyscale · Feb 21, 2024 · 6a67e6a · 6a67e6a
2 parents e3b375e + 5768dfd
commit 6a67e6a
Show file tree

Hide file tree

Showing 93 changed files with 1,576 additions and 305 deletions.
diff --git a/.github/workflows/pre-commit.yaml b/.github/workflows/pre-commit.yaml
@@ -0,0 +1,14 @@
+name: pre-commit
+
+on:
+  pull_request:
+  push:
+    branches: [main]
+
+jobs:
+  pre-commit:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v3
+    - uses: actions/setup-python@v3
+    - uses: pre-commit/action@v3.0.1
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,20 @@
+default_stages: [commit, push]
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v3.2.0
+    hooks:
+      - id: check-added-large-files
+      - id: trailing-whitespace
+        # README might be auto-generated
+        exclude: templates/.+/README.md
+      - id: end-of-file-fixer
+        # README might be auto-generated
+        exclude: templates/.+/README.md
+  - repo: local
+    hooks:
+      - id: generate-readme
+        name: Auto generate README.md from README.ipynb
+        entry: ci/auto-generate-readme.sh
+        language: script
+        pass_filenames: false
+        dependencies: [jupyter]
diff --git a/README.md b/README.md
@@ -6,6 +6,10 @@ These templates are a set of minimal examples & tutorials for customers to run o
 
 If the template is generic to Ray & Ray libraries, please consider adding the template in: https://github.com/ray-project/ray/tree/master/doc/source/templates
 
+To setup the environment:
+1. Install pre-commit `pip install pre-commit`
+2. Install the git hook scripts `pre-commit install`
+
 
 To add a template:
 
@@ -23,10 +27,8 @@ To add a template:
     Your template does not need to be a Jupyter notebook. It can also be presented as a
     Python script.
 
-    All templates MUST have a `README.md` file.
+    All templates MUST have a `README.md` or a `README.ipynb` file.
 
 2. Add your compute configuration under `configs/<your-template-name>` (for both AWS and GCE).
 
 3. Update the product repo `backend/workspace-templates.yaml` to point to the new template added here after being merged.
-
-4. Coming soon: update the integration tests in the product repo.
diff --git a/ci/auto-generate-readme.sh b/ci/auto-generate-readme.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+# Search for notebook files named "demo" in the ../templates directory
+notebook_files=$(find ../templates -name "README.ipynb")
+
+# Loop through each notebook file
+for notebook_file in $notebook_files; do
+    # Convert notebook file to README.md using nbconvert
+    jupyter nbconvert --to markdown "$notebook_file" --output-dir "$(dirname "$notebook_file")"
+done
+
+# Define the repo prefix
+REPO_PREFIX="https://raw.githubusercontent.com/anyscale/templates/main"
+
+# Search for README.md in the ../templates directory
+readme_files=$(find ../templates -name "README.md")
+
+# Loop through each readme files
+for readme_file in $readme_files; do
+    # Extract the path of the directory containing the README file, relative to the repository root
+    readme_dir=$(dirname "$readme_file" | sed "s|\.\./templates/||")
+
+    # Check the operating system
+    if [[ "$OSTYPE" == "darwin"* ]]; then
+        # macOS system
+        sed -i '' "s|<img src=\"\([^\"http://][^\":/][^\"].*\)\"|<img src=\"${REPO_PREFIX}/${readme_dir}/\1\"|g" "$readme_file"
+    else
+        # Assuming Linux
+        sed -i "s|<img src=\"\([^\"http://][^\":/][^\"].*\)\"|<img src=\"${REPO_PREFIX}/${readme_dir}/\1\"|g" "$readme_file"
+    fi
+done
diff --git a/configs/anyscale-ray-101/aws.yaml b/configs/anyscale-ray-101/aws.yaml
@@ -7,4 +7,4 @@ worker_node_types:
   instance_type: m5.2xlarge
   min_workers: 0
   max_workers: 2
-  use_spot: false
+  use_spot: false
diff --git a/configs/anyscale-ray-101/gce.yaml b/configs/anyscale-ray-101/gce.yaml
@@ -7,4 +7,4 @@ worker_node_types:
   instance_type: n2-standard-8
   min_workers: 0
   max_workers: 2
-  use_spot: false
+  use_spot: false
diff --git a/configs/endpoints/gcp.yaml b/configs/endpoints/gcp.yaml
@@ -87,7 +87,7 @@ worker_node_types:
   max_workers: 100
   use_spot: true
   fallback_to_ondemand: true
-- name: gpu-worker-a100-40g-1 
+- name: gpu-worker-a100-40g-1
   instance_type: a2-highgpu-1g-nvidia-a100-40gb-1
   resources:
     cpu:
@@ -98,7 +98,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-2 
+- name: gpu-worker-a100-40g-2
   instance_type: a2-highgpu-2g-nvidia-a100-40gb-2
   resources:
     cpu:
@@ -109,7 +109,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-4 
+- name: gpu-worker-a100-40g-4
   instance_type: a2-highgpu-4g-nvidia-a100-40gb-4
   resources:
     cpu:
@@ -120,7 +120,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-8 
+- name: gpu-worker-a100-40g-8
   instance_type: a2-highgpu-8g-nvidia-a100-40gb-8
   resources:
     cpu:

diff --git a/configs/endpoints_v2/gcp.yaml b/configs/endpoints_v2/gcp.yaml
@@ -87,7 +87,7 @@ worker_node_types:
   max_workers: 100
   use_spot: true
   fallback_to_ondemand: true
-- name: gpu-worker-a100-40g-1 
+- name: gpu-worker-a100-40g-1
   instance_type: a2-highgpu-1g-nvidia-a100-40gb-1
   resources:
     cpu:
@@ -98,7 +98,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-2 
+- name: gpu-worker-a100-40g-2
   instance_type: a2-highgpu-2g-nvidia-a100-40gb-2
   resources:
     cpu:
@@ -109,7 +109,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-4 
+- name: gpu-worker-a100-40g-4
   instance_type: a2-highgpu-4g-nvidia-a100-40gb-4
   resources:
     cpu:
@@ -120,7 +120,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-8 
+- name: gpu-worker-a100-40g-8
   instance_type: a2-highgpu-8g-nvidia-a100-40gb-8
   resources:
     cpu:

diff --git a/configs/fine-tune-GPTJ/aws.yaml b/configs/fine-tune-GPTJ/aws.yaml
@@ -7,4 +7,4 @@ worker_node_types:
   instance_type: g4dn.4xlarge
   min_workers: 0
   max_workers: 15
-  use_spot: false
+  use_spot: false
diff --git a/configs/fine-tune-GPTJ/gce.yaml b/configs/fine-tune-GPTJ/gce.yaml
@@ -7,4 +7,4 @@ worker_node_types:
   instance_type: n1-standard-16-nvidia-t4-16gb-1
   min_workers: 1
   max_workers: 15
-  use_spot: false
+  use_spot: false
diff --git a/configs/fine-tune-llama2/aws.yaml b/configs/fine-tune-llama2/aws.yaml
@@ -7,4 +7,4 @@ worker_node_types:
   instance_type: g5.4xlarge
   min_workers: 0
   max_workers: 100
-  use_spot: false
+  use_spot: false
diff --git a/configs/fine-tune-llama2/gce.yaml b/configs/fine-tune-llama2/gce.yaml
@@ -7,4 +7,4 @@ worker_node_types:
   instance_type: g2-standard-16-nvidia-l4-1
   min_workers: 0
   max_workers: 100
-  use_spot: false
+  use_spot: false
diff --git a/configs/fine-tune-llm/aws.yaml b/configs/fine-tune-llm/aws.yaml
@@ -0,0 +1,4 @@
+head_node_type:
+  name: head-node-type
+  instance_type: m5.xlarge
+worker_node_types: []
diff --git a/configs/fine-tune-llm/gce.yaml b/configs/fine-tune-llm/gce.yaml
@@ -0,0 +1,4 @@
+head_node_type:
+  name: head_node_type
+  instance_type: n1-standard-4
+worker_node_types: []
diff --git a/configs/intro-workspaces/aws.yaml b/configs/intro-workspaces/aws.yaml
@@ -1,3 +1,4 @@
 head_node_type:
   name: head_node_type
   instance_type: m5.xlarge
+worker_node_types: []
diff --git a/configs/intro-workspaces/gce.yaml b/configs/intro-workspaces/gce.yaml
@@ -1,3 +1,4 @@
 head_node_type:
   name: head_node_type
   instance_type: n2-standard-4
+worker_node_types: []
diff --git a/configs/serve-stable-diffusion-aica/gcp.yaml b/configs/serve-stable-diffusion-aica/gcp.yaml
@@ -87,7 +87,7 @@ worker_node_types:
   max_workers: 100
   use_spot: true
   fallback_to_ondemand: true
-- name: gpu-worker-a100-40g-1 
+- name: gpu-worker-a100-40g-1
   instance_type: a2-highgpu-1g-nvidia-a100-40gb-1
   resources:
     cpu:
@@ -98,7 +98,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-2 
+- name: gpu-worker-a100-40g-2
   instance_type: a2-highgpu-2g-nvidia-a100-40gb-2
   resources:
     cpu:
@@ -109,7 +109,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-4 
+- name: gpu-worker-a100-40g-4
   instance_type: a2-highgpu-4g-nvidia-a100-40gb-4
   resources:
     cpu:
@@ -120,7 +120,7 @@ worker_node_types:
       "accelerator_type:A100-40G": 1
   min_workers: 0
   max_workers: 100
-- name: gpu-worker-a100-40g-8 
+- name: gpu-worker-a100-40g-8
   instance_type: a2-highgpu-8g-nvidia-a100-40gb-8
   resources:
     cpu:

diff --git a/configs/serve-stable-diffusion/aws.yaml b/configs/serve-stable-diffusion/aws.yaml
@@ -7,3 +7,4 @@ worker_node_types:
   min_workers: 0
   max_workers: 100
   use_spot: false
+auto_select_worker_config: true
diff --git a/configs/serve-stable-diffusion/gce.yaml b/configs/serve-stable-diffusion/gce.yaml
@@ -1,10 +1,11 @@
-# n1-standard-8-nvidia-tesla-t4-1  --> 8 CPUs, 1 GPU
+# n1-standard-8-nvidia-t4-16gb-1  --> 8 CPUs, 1 GPU
 head_node_type:
   name: head_node_type
-  instance_type: n1-standard-8-nvidia-tesla-t4-1
+  instance_type: n1-standard-8-nvidia-t4-16gb-1
 worker_node_types:
 - name: gpu_worker
-  instance_type: n1-standard-8-nvidia-tesla-t4-1
+  instance_type: n1-standard-8-nvidia-t4-16gb-1
   min_workers: 0
   max_workers: 100
   use_spot: false
+auto_select_worker_config: true
diff --git a/templates/endpoints/AdvancedModelConfigs.md b/templates/endpoints/AdvancedModelConfigs.md
@@ -5,10 +5,10 @@ Each model is defined by a YAML configuration file in the `models` directory.
 ## Modify an existing model
 
 To modify an existing model, simply edit the YAML file for that model.
-Each config file consists of three sections: 
+Each config file consists of three sections:
 
-- `deployment_config`, 
-- `engine_config`, 
+- `deployment_config`,
+- `engine_config`,
 - `scaling_config`.
 
 It's best to check out examples of existing models to see how they are configured.
@@ -24,7 +24,7 @@ and specifies how to [auto-scale the model](https://docs.ray.io/en/latest/serve/
 * `max_concurrent_queries` - Maximum number of queries that a Ray Serve replica can process at a time. Additional queries are queued at the proxy.
 * `target_num_ongoing_requests_per_replica` - Guides the auto-scaling behavior. If the average number of ongoing requests across replicas is above this number, Ray Serve attempts to scale up the number of replicas, and vice-versa for downscaling. We typically set this to ~40% of the `max_concurrent_queries`.
 * `ray_actor_options` - Similar to the `resources_per_worker` configuration in the `scaling_config`. Refer to the `scaling_config` section for more guidance.
-* `smoothing_factor` - The multiplicative factor to amplify or moderate each upscaling or downscaling decision. A value less than 1.0 will slow down the scaling decision made in each step. See [advanced auto-scaling guide](https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#optional-define-how-the-system-reacts-to-changing-traffic) for more details. 
+* `smoothing_factor` - The multiplicative factor to amplify or moderate each upscaling or downscaling decision. A value less than 1.0 will slow down the scaling decision made in each step. See [advanced auto-scaling guide](https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#optional-define-how-the-system-reacts-to-changing-traffic) for more details.
 
 ## Engine config
 
@@ -36,7 +36,7 @@ RayLLM supports continuous batching, meaning incoming requests are processed as
 
 * `model_id` is the ID that refers to the model in the RayLLM or OpenAI API.
 * `type` is the type of  inference engine. Only `VLLMEngine` is currently supported.
-* `engine_kwargs` and `max_total_tokens` are configuration options for the inference engine (e.g. gpu memory utilization, quantization, max number of concurrent sequences). These options may vary depending on the hardware accelerator type and model size. We have tuned the parameters in the configuration files included in RayLLM for you to use as reference. 
+* `engine_kwargs` and `max_total_tokens` are configuration options for the inference engine (e.g. gpu memory utilization, quantization, max number of concurrent sequences). These options may vary depending on the hardware accelerator type and model size. We have tuned the parameters in the configuration files included in RayLLM for you to use as reference.
 * `generation` contains configurations related to default generation parameters such as `prompt_format` and `stopping_sequences`.
 * `hf_model_id` is the Hugging Face model ID. If not specified, defaults to `model_id`.
 * `runtime_env` is a dictionary that contains Ray runtime environment configuration. It allows you to set per-model pip packages and environment variables. See [Ray documentation on Runtime Environments](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) for more information.
@@ -51,7 +51,7 @@ Finally, the `scaling_config` section specifies what resources should be used to
 * `num_gpus_per_worker` - Number of GPUs to be allocated per worker. This should always be 1.
 * `num_cpus_per_worker` - Number of CPUs to be allocated per worker. Usually set to 8.
 * `placement_strategy` - Ray supports different [placement strategies](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html#placement-strategy) for guiding the physical distribution of workers. To ensure all workers are on the same node, use "STRICT_PACK".
-* `resources_per_worker` - we use `resources_per_worker` to set [Ray custom resources](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#id1) and place the models on specific node types. An example configuration of `resources_per_worker` involves setting `accelerator_type:L4` to 0.001 for a Llama-2-7b model to be deployed on an L4 GPU. This must always be set to 0.001. The `num_gpus_per_worker` configuration along with number of GPUs available on the node will determine the number of workers Ray schedules on the node. The supported accelerator types are: T4, L4, A10G, A100-40G and A100-80G.  
+* `resources_per_worker` - we use `resources_per_worker` to set [Ray custom resources](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#id1) and place the models on specific node types. An example configuration of `resources_per_worker` involves setting `accelerator_type:L4` to 0.001 for a Llama-2-7b model to be deployed on an L4 GPU. This must always be set to 0.001. The `num_gpus_per_worker` configuration along with number of GPUs available on the node will determine the number of workers Ray schedules on the node. The supported accelerator types are: T4, L4, A10G, A100-40G and A100-80G.
 
 ## My deployment isn't starting/working correctly, how can I debug?
 

diff --git a/templates/endpoints/CustomModels.md b/templates/endpoints/CustomModels.md
@@ -1,11 +1,11 @@
 # Adding a new model
 
-RayLLM supports fine-tuned versions of models in the `models` directory as well as model architectures supported by [vLLM](https://docs.vllm.ai/en/latest/models/supported_models.html). You can either bring a model from HuggingFace or artifact storage like S3, GCS. 
+RayLLM supports fine-tuned versions of models in the `models` directory as well as model architectures supported by [vLLM](https://docs.vllm.ai/en/latest/models/supported_models.html). You can either bring a model from HuggingFace or artifact storage like S3, GCS.
 
 ## Configuring a new model
 
 To add an entirely new model to the zoo, you will need to create a new YAML file.
-This file should follow the naming convention 
+This file should follow the naming convention
 `<organisation-name>--<model-name>-<model-parameters>-<extra-info>.yaml`. We recommend using one of the existing models as a template (ideally, one that is the same architecture and number of parameters as the model you are adding). The examples in the `models` directory should help you get started. You can look at the [Advanced Model Configs](./AdvancedModelConfigs.md) for more details on these configurations.
 
 ```yaml
@@ -75,7 +75,7 @@ scaling_config:
 
 ```
 
-## Adding a private model 
+## Adding a private model
 
 For loading a model from S3 or GCS, set `engine_config.s3_mirror_config.bucket_uri` or `engine_config.gcs_mirror_config.bucket_uri` to point to a folder containing your model and tokenizer files (`config.json`, `tokenizer_config.json`, `.bin`/`.safetensors` files, etc.) and set `engine_config.model_id` to any ID you desire in the `organization/model` format, eg. `myorganization/llama2-finetuned`. The model will be downloaded to a folder in the `<TRANSFORMERS_CACHE>/models--<organization-name>--<model-name>/snapshots/<HASH>` directory on each node in the cluster. `<HASH>` will be determined by the contents of `hash` file in the S3 folder, or default to `0000000000000000000000000000000000000000`. See the [HuggingFace transformers documentation](https://huggingface.co/docs/transformers/main/en/installation#cache-setup).
 

diff --git a/templates/endpoints/DeployFunctionCalling.md b/templates/endpoints/DeployFunctionCalling.md
@@ -21,11 +21,11 @@ For Example, you can see `models/mistral/mistralai--Mistral-7B-Instruct-v0.1_a10
     enable_json_logits_processors: true
 ```
 
-2. Set `standalone_function_calling_model: true` in top level configuration. 
+2. Set `standalone_function_calling_model: true` in top level configuration.
 
 # Step 2 - Deploying & Querying Function calling model
 
-`func_calling-serve.yaml` and `func_calling-query.py` are provided for you in this template. 
+`func_calling-serve.yaml` and `func_calling-query.py` are provided for you in this template.
 
 In order to deploy a model in function calling mode you need to edit `func_calling-serve.yaml`:
 Under `function_calling_models` add path to the model you want to use. You can add multiple model

diff --git a/templates/endpoints/DeployLora.md b/templates/endpoints/DeployLora.md
@@ -1,10 +1,10 @@
 # Serving LoRA Models
 
-We support serving multiple LoRA adapters with a common base model in the same request batch which allows you to serve a wide variety of use-cases without increasing hardware spend. In addition, we use Serve multiplexing to reduce the number of swaps for LoRA adapters. There is a slight latency overhead to serving a LoRA model compared to the base model, typically 10-20%. 
+We support serving multiple LoRA adapters with a common base model in the same request batch which allows you to serve a wide variety of use-cases without increasing hardware spend. In addition, we use Serve multiplexing to reduce the number of swaps for LoRA adapters. There is a slight latency overhead to serving a LoRA model compared to the base model, typically 10-20%.
 
 # Setup LoRA Model Deployment
 
-`lora-serve.yaml` and `lora-query.py` are provided for you in this template. 
+`lora-serve.yaml` and `lora-query.py` are provided for you in this template.
 
 In order to deploy LoRA adapters you would need to update `lora-serve.yaml`:
 1. `dynamic_lora_loading_path` - The LoRA checkpoints are loaded from the artifact storage path specified in `dynamic_lora_loading_path`. The path to the checkpoints must be in the following format: `{dynamic_lora_loading_path}/{base_model_id}:{suffix}:{id}`, e.g. `s3://my-bucket/my-lora-checkouts/meta-llama/Llama-2-7b-chat-hf:lora-model:1234`. The models can be loaded from any accessible AWS S3 or Google Cloud Storage bucket. You can use an existing bucket where you have the LoRA models or can upload the models to `$ANYSCALE_ARTIFACT_STORAGE` already provided by Anyscale Workspace. New models can be uploaded to the `dynamic_lora_loading_path` dynamically before or after the Serve application is launched.

diff --git a/templates/endpoints/EmbeddingModels.md b/templates/endpoints/EmbeddingModels.md
@@ -4,7 +4,7 @@ We support serving embedding models available in HuggingFace as well as optimizi
 
 # Setting up Model
 
-See an example for serving embedding models in `embedding-serve.yaml`. Notably the `args` field in the yaml file needs to contain the `embedding_models` field. This field contains a list of YAML files (in the `models` directory) for the embedding models you want to deploy. 
+See an example for serving embedding models in `embedding-serve.yaml`. Notably the `args` field in the yaml file needs to contain the `embedding_models` field. This field contains a list of YAML files (in the `models` directory) for the embedding models you want to deploy.
 
 In order to deploy an embedding model run:
 ```shell
@@ -21,7 +21,7 @@ python embedding-query.py
 
 # Optimizing Embedding Models
 
-We support optimizing embedding models with ONNX. In order to enable this, set the flag under `engine_config` in your model yaml file. See `models/embedding_models\BAAI--bge-large-en-v1.5.yaml` for an example. 
+We support optimizing embedding models with ONNX. In order to enable this, set the flag under `engine_config` in your model yaml file. See `models/embedding_models\BAAI--bge-large-en-v1.5.yaml` for an example.
 
 ```shell
 engine_config: