Skip to content

Commit

Permalink
Merge branch 'main' into etang/finetuning-logging
Browse files Browse the repository at this point in the history
  • Loading branch information
SumanthRH authored Sep 18, 2024
2 parents c3962ad + 7c03379 commit c6c3b7b
Show file tree
Hide file tree
Showing 25 changed files with 791 additions and 447 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/pre-commit.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,12 @@ jobs:
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
with:
python-version: '3.9'

# Install pre-commit dependencies
- name: Install pre-commit
run: pip install pre-commit jupyter
run: pip install pre-commit==3.8.0 jupyter==1.1.1

# Run pre-commit hooks with verbose logging
- name: Run pre-commit
Expand Down
2 changes: 1 addition & 1 deletion templates/batch-llm/README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,7 @@
"### Monitoring Dataset execution\n",
"We can use the Ray Dashboard to monitor the Dataset execution. In the Ray Dashboard tab, navigate to the Job page and open the \"Ray Data Overview\" section. Click on the link for the running job, and open the \"Ray Data Overview\" section to view the details of the batch inference execution:\n",
"\n",
"<img src=\"assets/ray-data-jobs.png\" width=900px/>"
"<img src=\"assets/ray-data-jobs.png\" width=900px />"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion templates/batch-llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ print(f"Batch inference result is written into {output_path}.")
### Monitoring Dataset execution
We can use the Ray Dashboard to monitor the Dataset execution. In the Ray Dashboard tab, navigate to the Job page and open the "Ray Data Overview" section. Click on the link for the running job, and open the "Ray Data Overview" section to view the details of the batch inference execution:

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/batch-llm/assets/ray-data-jobs.png" width=900px/>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/batch-llm/assets/ray-data-jobs.png" width=900px />

### Handling GPU out-of-memory failures
If you run into CUDA out of memory, your batch size is likely too large. Decrease the batch size as described above.
Expand Down
607 changes: 477 additions & 130 deletions templates/e2e-llm-workflows/README.ipynb

Large diffs are not rendered by default.

375 changes: 197 additions & 178 deletions templates/e2e-llm-workflows/README.md

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion templates/e2e-llm-workflows/deploy/jobs/ft.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
name: e2e-llm-workflows
entrypoint: llmforge anyscale finetune configs/training/lora/llama-3-8b.yaml
image_uri: localhost:5555/anyscale/llm-forge:0.5.3
image_uri: localhost:5555/anyscale/llm-forge:0.5.4
requirements: []
max_retries: 1
excludes: ["assets"]
75 changes: 0 additions & 75 deletions templates/e2e-llm-workflows/src/ft.py

This file was deleted.

52 changes: 51 additions & 1 deletion templates/e2e-llm-workflows/src/utils.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
from google.cloud import storage
from contextlib import contextmanager
import os
from tempfile import TemporaryDirectory
import boto3
from urllib.parse import urlparse
from ray.data import Dataset


def download_files_from_bucket(bucket, path, local_dir):
def download_files_from_s3(s3_uri, local_dir):
parsed_uri = urlparse(s3_uri)
if parsed_uri.scheme != "s3":
raise ValueError(f"Expected S3 URI, got {s3_uri}")
bucket = parsed_uri.netloc
path = parsed_uri.path.lstrip("/")

s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=bucket, Prefix=path)
Expand All @@ -13,3 +24,42 @@ def download_files_from_bucket(bucket, path, local_dir):
os.makedirs(os.path.dirname(local_path), exist_ok=True)
s3.download_file(bucket, key, local_path)
print(f"Downloaded {key} to {local_path}")

def download_files_from_gcs(gcs_uri, local_dir):
parsed_uri = urlparse(gcs_uri)
if parsed_uri.scheme != "gs":
raise ValueError(f"Expected GCS URI, got {gcs_uri}")
bucket_name = parsed_uri.netloc
prefix = parsed_uri.path.lstrip("/")

storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)
for blob in blobs:
# Skip in case the blob is the root folder
if blob.name.rstrip("/") == prefix:
continue
local_path = os.path.join(local_dir, blob.name)
os.makedirs(os.path.dirname(local_path), exist_ok=True)
blob.download_to_filename(local_path)
print(f"Downloaded {blob.name} to {local_path}")

def download_files_from_remote(uri, local_dir):
parsed_uri = urlparse(uri)
if parsed_uri.scheme == "gs":
download_files_from_gcs(uri, local_dir)
elif parsed_uri.scheme == "s3":
download_files_from_s3(uri, local_dir)
else:
raise ValueError(f"Expected S3 or GCS URI, got {uri}")


@contextmanager
def get_dataset_file_path(dataset: Dataset):
"""Transforms a Ray `Dataset` into a single temp. JSON file written on disk.
Yields the path to the file."""
with TemporaryDirectory() as temp_path:
dataset.repartition(1).write_json(temp_path)
assert len(os.listdir(temp_path)) == 1, "The dataset should be written to a single file"
dataset_file_path = f"{temp_path}/{os.listdir(temp_path)[0]}"
yield dataset_file_path
6 changes: 3 additions & 3 deletions templates/endpoints_v2/README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -186,11 +186,11 @@
"source": [
"After the command runs, click the deploy notification (or navigate to ``Home > Services``) to access the Service UI:\n",
"\n",
"<img src=\"assets/service-notify.png\" width=500px/>\n",
"<img src=\"assets/service-notify.png\" width=500px />\n",
"\n",
"Navigate to the Service UI and wait for the service to reach \"Active\". It will begin in \"Starting\" state:\n",
"\n",
"<img src=\"assets/service-starting.png\" width=600px/>"
"<img src=\"assets/service-starting.png\" width=600px />"
]
},
{
Expand All @@ -204,7 +204,7 @@
"\n",
"You can also find this information by clicking the \"Query\" button in the Service UI.\n",
"\n",
"<img src=\"assets/service-query.png\" width=600px/>"
"<img src=\"assets/service-query.png\" width=600px />"
]
},
{
Expand Down
6 changes: 3 additions & 3 deletions templates/endpoints_v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,11 +123,11 @@ To deploy an application with one model as an Anyscale Service, update the file

After the command runs, click the deploy notification (or navigate to ``Home > Services``) to access the Service UI:

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/endpoints_v2/assets/service-notify.png" width=500px/>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/endpoints_v2/assets/service-notify.png" width=500px />

Navigate to the Service UI and wait for the service to reach "Active". It will begin in "Starting" state:

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/endpoints_v2/assets/service-starting.png" width=600px/>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/endpoints_v2/assets/service-starting.png" width=600px />


## Step 5 - Query the service endpoint
Expand All @@ -136,7 +136,7 @@ The above command should print something like `(anyscale +2.9s) curl -H 'Authori

You can also find this information by clicking the "Query" button in the Service UI.

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/endpoints_v2/assets/service-query.png" width=600px/>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/endpoints_v2/assets/service-query.png" width=600px />


```python
Expand Down
4 changes: 2 additions & 2 deletions templates/intro-batch-inference/README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@
"\n",
"In the Ray Dashboard tab, navigate to the Job page and open the \"Ray Data Overview\" section to view the details of the batch inference execution:\n",
"\n",
"<img src=\"assets/ray-data-job.png\" width=800px/>\n",
"<img src=\"assets/ray-data-job.png\" width=800px />\n",
"\n"
]
},
Expand Down Expand Up @@ -134,7 +134,7 @@
"\n",
"The remaining is the same as in the code we ran above. To test this out, first make sure to either enable *Auto-select worker nodes* or configure your workspace cluster to have GPU worker nodes:\n",
"\n",
"<img src=\"assets/ray-data-gpu.png\" width=300px/>\n",
"<img src=\"assets/ray-data-gpu.png\" width=300px />\n",
"\n",
"Run the below cell to test out the new code using GPUs:"
]
Expand Down
4 changes: 2 additions & 2 deletions templates/intro-batch-inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Note that above we called ``ds.show()`` in order to print the results to the con

In the Ray Dashboard tab, navigate to the Job page and open the "Ray Data Overview" section to view the details of the batch inference execution:

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-batch-inference/assets/ray-data-job.png" width=800px/>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-batch-inference/assets/ray-data-job.png" width=800px />



Expand Down Expand Up @@ -101,7 +101,7 @@ To use GPUs for inference, make the following changes to your code:

The remaining is the same as in the code we ran above. To test this out, first make sure to either enable *Auto-select worker nodes* or configure your workspace cluster to have GPU worker nodes:

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-batch-inference/assets/ray-data-gpu.png" width=300px/>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-batch-inference/assets/ray-data-gpu.png" width=300px />

Run the below cell to test out the new code using GPUs:

Expand Down
2 changes: 1 addition & 1 deletion templates/intro-jobs/README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@
"\n",
"You should see the job state and its output on the overview page.\n",
"\n",
"<img src=\"assets/anyscale-job.png\" height=400px>"
"<img src=\"assets/anyscale-job.png\" height=400px />"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion templates/intro-jobs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ You can view active and historical job runs at (`Home > Jobs`). Click into the j

You should see the job state and its output on the overview page.

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-jobs/assets/anyscale-job.png" height=400px>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-jobs/assets/anyscale-job.png" height=400px />

## Submitting a Job programmatically

Expand Down
8 changes: 4 additions & 4 deletions templates/intro-services/README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -234,11 +234,11 @@
"\n",
"By clicking on the **Running** service, you can view the status of deployments and how many replicas each contains. For example, your `FastAPIDeployment` has `1` replica.\n",
"\n",
"<img src=\"assets/service-overview.png\" height=400px>\n",
"<img src=\"assets/service-overview.png\" height=400px />\n",
"\n",
"In the Logs, you can search for the message “Handling request!” to view each request for easier debugging.\n",
"\n",
"<img src=\"assets/service-logs.png\" height=400px>\n"
"<img src=\"assets/service-logs.png\" height=400px />\n"
]
},
{
Expand Down Expand Up @@ -275,7 +275,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"assets/service-replicas.png\" height=400px>\n",
"<img src=\"assets/service-replicas.png\" height=400px />\n",
"\n",
"**Note**: This approach is a way to quickly modify scale for this example. As a best practice in production, define [autoscaling behavior](https://docs.anyscale.com/platform/services/scale-a-service#autoscaling) in the [ServiceConfig](https://docs.anyscale.com/reference/service-api#serviceconfig) contained in a `config.yaml` file. The number of worker nodes that Anyscale launches dynamically scales up and down in response to traffic and is scoped by the overall cluster compute config you define.\n"
]
Expand Down Expand Up @@ -304,7 +304,7 @@
"source": [
"In the service overview page, you can monitor the status of the update and see Ray Serve shut down the previous cluster.\n",
"\n",
"<img src=\"assets/service-rollout.png\" height=400px>\n",
"<img src=\"assets/service-rollout.png\" height=400px />\n",
"\n",
"**Note**: Using this command triggers an automatic rollout which gradually shifts traffic from the previous cluster, or primary version, to the incoming cluster, or canary version. To learn more about configuring rollout behavior, see [Update a service](https://docs.anyscale.com/platform/services/update-a-service).\n"
]
Expand Down
8 changes: 4 additions & 4 deletions templates/intro-services/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,11 +160,11 @@ To view the service, navigate to 🏠 **> Services > `my_service`**. On this pag

By clicking on the **Running** service, you can view the status of deployments and how many replicas each contains. For example, your `FastAPIDeployment` has `1` replica.

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-services/assets/service-overview.png" height=400px>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-services/assets/service-overview.png" height=400px />

In the Logs, you can search for the message “Handling request!” to view each request for easier debugging.

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-services/assets/service-logs.png" height=400px>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-services/assets/service-logs.png" height=400px />


### Configure scaling
Expand Down Expand Up @@ -193,7 +193,7 @@ my_app = FastAPIDeployment.bind()
```


<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-services/assets/service-replicas.png" height=400px>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-services/assets/service-replicas.png" height=400px />

**Note**: This approach is a way to quickly modify scale for this example. As a best practice in production, define [autoscaling behavior](https://docs.anyscale.com/platform/services/scale-a-service#autoscaling) in the [ServiceConfig](https://docs.anyscale.com/reference/service-api#serviceconfig) contained in a `config.yaml` file. The number of worker nodes that Anyscale launches dynamically scales up and down in response to traffic and is scoped by the overall cluster compute config you define.

Expand All @@ -210,7 +210,7 @@ To deploy the update, execute the following command to trigger a staged rollout

In the service overview page, you can monitor the status of the update and see Ray Serve shut down the previous cluster.

<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-services/assets/service-rollout.png" height=400px>
<img src="https://raw.githubusercontent.com/anyscale/templates/main/templates/intro-services/assets/service-rollout.png" height=400px />

**Note**: Using this command triggers an automatic rollout which gradually shifts traffic from the previous cluster, or primary version, to the incoming cluster, or canary version. To learn more about configuring rollout behavior, see [Update a service](https://docs.anyscale.com/platform/services/update-a-service).

Expand Down
12 changes: 6 additions & 6 deletions templates/intro-tune/README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
"\n",
"You should see during the run a table of the trials created by Tune. One trial is created for each individual value of `x` in the grid sweep. The table shows where the trial was run in the cluster, how long the trial took, and reported metrics:\n",
"\n",
"<img src=\"assets/tune-status.png\" width=800px/>\n",
"<img src=\"assets/tune-status.png\" width=800px />\n",
"\n",
"On completion, it returns a `ResultGrid` object that captures the experiment results. This includes the reported trial metrics, the path where trial results are saved:\n",
"\n",
Expand All @@ -73,7 +73,7 @@
"\n",
"To view the stdout and stderr of the trial, use the ``Logs`` tab in the Workspace UI. Navigate to the log page and search for \"hello\", and you'll be able to see the logs printed for each trial run in the cluster:\n",
"\n",
"<img src=\"assets/tune-logs.png\" width=800px/>\n",
"<img src=\"assets/tune-logs.png\" width=800px />\n",
"\n",
"Tune also saves a number of input and output metadata files for each trial to storage, you can view them by querying the returned result object:\n",
"- ``params.json``: The input parameters of the trial\n",
Expand Down Expand Up @@ -258,7 +258,7 @@
"source": [
"During and after the execution, Tune reports a table of current trial status and reported accuracy. You can find the configuration that achieves the highest accuracy on the validation set:\n",
"\n",
"<img src=\"assets/tune-output.png\" width=600px/>\n"
"<img src=\"assets/tune-output.png\" width=600px />\n"
]
},
{
Expand Down Expand Up @@ -292,15 +292,15 @@
"\n",
"First, let's view the run in the Jobs sub-tab and click through to into the job view. Here, you can see an overview of the job, and the status of the individual actors Tune has launched to parallelize the job:\n",
"\n",
"<img src=\"assets/tune-jobs-1.png\" width=800px/>\n",
"<img src=\"assets/tune-jobs-1.png\" width=800px />\n",
"\n",
"You can further click through to the actors sub-page and view the status of individual running actors. Inspect trial logs, CPU profiles, and memory profiles using this page:\n",
"\n",
"<img src=\"assets/tune-jobs-2.png\" width=800px/>\n",
"<img src=\"assets/tune-jobs-2.png\" width=800px />\n",
"\n",
"Finally, we can observe the holistic execution of the job in the cluster in the Metrics sub-tab. When running the above job on a 36-CPU cluster, we can see that Tune was able to launch ~16 concurrent actors for trial execution, with each actor assigned 2 CPU slots as configured:\n",
"\n",
"<img src=\"assets/tune-metrics.png\" width=800px/>\n"
"<img src=\"assets/tune-metrics.png\" width=800px />\n"
]
},
{
Expand Down
Loading

0 comments on commit c6c3b7b

Please sign in to comment.