From c5c645bc7632bce3de503a3ec130e23b0a97a385 Mon Sep 17 00:00:00 2001 From: inardini Date: Tue, 6 Feb 2024 15:43:52 +0000 Subject: [PATCH] add notebook and reference --- gemini/README.md | 9 + .../evaluate_gemini_with_autosxs.ipynb | 932 ++++++++++++++++++ 2 files changed, 941 insertions(+) create mode 100644 gemini/evaluation/evaluate_gemini_with_autosxs.ipynb diff --git a/gemini/README.md b/gemini/README.md index 0c0b3c487b..0b596a0817 100644 --- a/gemini/README.md +++ b/gemini/README.md @@ -98,5 +98,14 @@ The notebooks and samples in this folder focus on using the **Vertex AI SDK for Learning resources (e.g. blogs, Youtube playlists) about Generative AI on Google Cloud Resources (e.g. videos, blogposts, learning paths) + + + +
+ RESOURCES.md + + Learn how to evaluate Gemini with Vertex AI Model Evaluation for GenAI + Sample notebooks + diff --git a/gemini/evaluation/evaluate_gemini_with_autosxs.ipynb b/gemini/evaluation/evaluate_gemini_with_autosxs.ipynb new file mode 100644 index 0000000000..c12356fdcc --- /dev/null +++ b/gemini/evaluation/evaluate_gemini_with_autosxs.ipynb @@ -0,0 +1,932 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ur8xi4C7S06n" + }, + "outputs": [], + "source": [ + "# Copyright 2023 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JAPoU8Sm5E6e" + }, + "source": [ + "# Evaluate LLMs with AutoSxS Model Eval\n", + "\n", + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + " \n", + " \n", + " \"Vertex\n", + " Open in Vertex AI Workbench\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "24743cf4a1e1" + }, + "source": [ + "**_NOTE_**: This notebook has been tested in the following environment:\n", + "\n", + "* Python version = 3.9" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvgnzT1CKxrO" + }, + "source": [ + "## Overview\n", + "\n", + "Vertex AI Model Evaluation AutoSxS is an LLM evaluation tool, which enables users to compare the performance of Google-first-party and Third-party LLMs.\n", + "\n", + "As part of preview release, AutoSxS only support comparing models for `summarization` and `question answering` tasks according to some predefined criteria. Check out the [documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#autosxs) to know more.\n", + "\n", + "\n", + "This tutorial demostrates how to use Vertex AI Model Evaluation AutoSxS for comparing models and check human alignment in a summarization task." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d975e698c9a4" + }, + "source": [ + "### Objective\n", + "\n", + "In this tutorial, you learn how to use Vertex AI Model Evaluation AutoSxS to compare two LLMs predictions (one of the model is Gemini-Pro) in a summarization task.\n", + "\n", + "This tutorial uses the following Google Cloud ML services and resources:\n", + "\n", + "- Vertex AI Model Evaluation\n", + "\n", + "The steps performed include:\n", + "\n", + "- Read evaluation data\n", + "- Define the AutoSxS model evaluation pipeline\n", + "- Run the evaluation pipeline job\n", + "- Check the judgments, aggregated metrics and human alignment" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "08d289fa873f" + }, + "source": [ + "### Dataset\n", + "\n", + "The dataset is a modified sample of the [XSum](https://huggingface.co/datasets/EdinburghNLP/xsum) dataset for evaluation of abstractive single-document summarization systems." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aed92deeb4a0" + }, + "source": [ + "### Costs\n", + "\n", + "This tutorial uses billable components of Google Cloud:\n", + "\n", + "* Vertex AI\n", + "* Cloud Storage\n", + "\n", + "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i7EUnXsZhAGF" + }, + "source": [ + "## Installation\n", + "\n", + "Install the following packages required to execute this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kc4WxYmLSBW5" + }, + "outputs": [], + "source": [ + "! pip3 install --user --upgrade --quiet google-cloud-aiplatform" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "58707a750154" + }, + "source": [ + "### Colab only: Uncomment the following cell to restart the kernel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "f200f10a1da3" + }, + "outputs": [], + "source": [ + "# import IPython\n", + "\n", + "# app = IPython.Application.instance()\n", + "# app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BF1j6f9HApxa" + }, + "source": [ + "## Before you begin\n", + "\n", + "### Set up your Google Cloud project\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", + "\n", + "3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", + "\n", + "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WReHDGG5g0XY" + }, + "source": [ + "#### Set your project ID\n", + "\n", + "**If you don't know your project ID**, try the following:\n", + "* Run `gcloud config list`.\n", + "* Run `gcloud projects list`.\n", + "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oM1iC_MfAts1" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"your-project-id\" # @param {type:\"string\"}\n", + "\n", + "# Set the project id\n", + "! gcloud config set project {PROJECT_ID}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "region" + }, + "source": [ + "#### Region\n", + "\n", + "You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bTeyVM5oRObw" + }, + "outputs": [], + "source": [ + "REGION = \"us-central1\" # @param {type: \"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "timestamp" + }, + "source": [ + "### UUID\n", + "\n", + "We define a UUID generation function to avoid resource name collisions on resources created within the notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "84Vdv7R-QEH6" + }, + "outputs": [], + "source": [ + "import random\n", + "import string\n", + "\n", + "\n", + "def generate_uuid(length: int = 8) -> str:\n", + " \"\"\"Generate a uuid of a specifed length (default=8).\"\"\"\n", + " return \"\".join(random.choices(string.ascii_lowercase + string.digits, k=length))\n", + "\n", + "\n", + "UUID = generate_uuid()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sBCra4QMA2wR" + }, + "source": [ + "### Authenticate your Google Cloud account\n", + "\n", + "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74ccc9e52986" + }, + "source": [ + "**1. Vertex AI Workbench**\n", + "* Do nothing as you are already authenticated." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "de775a3773ba" + }, + "source": [ + "**2. Local JupyterLab instance, uncomment and run:**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "254614fa0c46" + }, + "outputs": [], + "source": [ + "# ! gcloud auth login" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ef21552ccea8" + }, + "source": [ + "**3. Colab, uncomment and run:**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "603adbbf0532" + }, + "outputs": [], + "source": [ + "# from google.colab import auth\n", + "# auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f6b2ccc891ed" + }, + "source": [ + "**4. Service account or other**\n", + "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zgPO1eR3CYjk" + }, + "source": [ + "### Create a Cloud Storage bucket\n", + "\n", + "Create a storage bucket to store intermediate artifacts such as datasets.\n", + "\n", + "- *{Note to notebook author: For any user-provided strings that need to be unique (like bucket names or model ID's), append \"-unique\" to the end so proper testing can occur}*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MzGDU7TWdts_" + }, + "outputs": [], + "source": [ + "BUCKET_URI = \"gs://autosxs-demo\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-EcIXiGsCePi" + }, + "source": [ + "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NIq7R4HZCfIc" + }, + "outputs": [], + "source": [ + "! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "960505627ddf" + }, + "source": [ + "### Import libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PyQmSRbKA8r-" + }, + "outputs": [], + "source": [ + "import os\n", + "import pprint\n", + "\n", + "import pandas as pd\n", + "from google.cloud import aiplatform\n", + "from google.protobuf.json_format import MessageToDict\n", + "from IPython.display import HTML, display" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fQnzv1Gs_cwN" + }, + "source": [ + "### Define constant" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "I9JsWvCz_eo0" + }, + "outputs": [], + "source": [ + "EVALUATION_FILE_URI = (\n", + " \"gs://github-repo/evaluate-gemini/sum_eval_gemini_dataset_001.jsonl\"\n", + ")\n", + "HUMAN_EVALUATION_FILE_URI = (\n", + " \"gs://github-repo/evaluate-gemini/sum_human_eval_gemini_dataset_001.jsonl\"\n", + ")\n", + "TEMPLATE_URI = \"https://us-kfp.pkg.dev/ml-pipeline/llm-rlhf/autosxs-template/2.8.0\"\n", + "PIPELINE_ROOT = f\"{BUCKET_URI}/pipeline\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NRkfTNeaHbZd" + }, + "source": [ + "### Helpers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ivbHUDiEHd2Q" + }, + "outputs": [], + "source": [ + "def print_autosxs_judgments(df, n=3):\n", + " \"\"\"Print AutoSxS judgments in the notebook\"\"\"\n", + "\n", + " style = \"white-space: pre-wrap; width: 800px; overflow-x: auto;\"\n", + " df = df.sample(n=n)\n", + "\n", + " for index, row in df.iterrows():\n", + " if row[\"confidence\"] >= 0.5:\n", + " display(\n", + " HTML(\n", + " f\"

Document:

{row['id_columns']['document']}
\"\n", + " )\n", + " )\n", + " display(\n", + " HTML(\n", + " f\"

Response A:

{row['response_a']}
\"\n", + " )\n", + " )\n", + " display(\n", + " HTML(\n", + " f\"

Response B:

{row['response_b']}
\"\n", + " )\n", + " )\n", + " display(\n", + " HTML(\n", + " f\"

Explanation:

{row['explanation']}
\"\n", + " )\n", + " )\n", + " display(\n", + " HTML(\n", + " f\"

Confidence score:

{row['confidence']}
\"\n", + " )\n", + " )\n", + " display(HTML(\"
\"))\n", + "\n", + "\n", + "def print_aggregated_metrics(scores):\n", + " \"\"\"Print AutoSxS aggregated metrics\"\"\"\n", + "\n", + " score_b = round(win_rate_metrics[\"autosxs_model_b_win_rate\"] * 100)\n", + " display(\n", + " HTML(\n", + " f\"

AutoSxS Autorater prefers {score_b}% of time Model B over Model A

\"\n", + " )\n", + " )\n", + "\n", + "\n", + "def print_human_preference_metrics(metrics):\n", + " \"\"\"Print AutoSxS Human-preference alignment metrics\"\"\"\n", + " display(\n", + " HTML(\n", + " f\"

AutoSxS Autorater prefers {score_b}% of time Model B over Model A

\"\n", + " )\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "init_aip:mbsdk,all" + }, + "source": [ + "### Initialize Vertex AI SDK for Python\n", + "\n", + "Initialize the Vertex AI SDK for Python for your project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j4KEcQEWROby" + }, + "outputs": [], + "source": [ + "aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qoiqIyiMvc3n" + }, + "source": [ + "## Evaluate LLMs using Vertex AI Model Evaluation AutoSxS\n", + "\n", + "Suppose you've obtained your LLM-generated predictions in a summarization task. To evaluate LLMs such as Gemini-Pro on Vertex AI against another using AutoSXS, you need to follow these steps for evaluation:\n", + "\n", + "1. **Prepare the Evaluation Dataset**: Gather your prompts, contexts, generated responses and human preference required for the evaluation.\n", + "\n", + "2. **Convert the Evaluation Dataset:** Convert the dataset either into the JSONL format and store it in a Cloud Storage bucket. Alternatively, you can save the dataset to a BigQuery table.\n", + "\n", + "3. **Run a Model Evaluation Job:** Use Vertex AI to run a model evaluation job to assess the performance of the LLM." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZEIlO0eHbsQh" + }, + "source": [ + "### Read the evaluation data\n", + "\n", + "In this summarization use case, you use `sum_eval_gemini_dataset_001`, a JSONL-formatted evaluation datasets which contains content-response pairs without human preferences.\n", + "\n", + "In the dataset, each row represents a single example. The dataset includes ID fields, such as \"id\" and \"document,\" which are used to identify each unique example. The \"document\" field contains the newspaper articles to be summarized.\n", + "\n", + "While the dataset does not have [data fields](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#prep-eval-dataset) for prompts and contexts, it does include pre-generated predictions. These predictions contain the generated response according to the LLMs task, with \"response_a\" and \"response_b\" representing different article summaries.\n", + "\n", + "**Note: For experimentation, you can provide only a few examples. The documentation recommends at least 400 examples to ensure high-quality aggregate metrics.**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R-_ettKRxfxT" + }, + "outputs": [], + "source": [ + "evaluation_gemini_df = pd.read_json(EVALUATION_FILE_URI, lines=True)\n", + "evaluation_gemini_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1lZHraNFkDz8" + }, + "source": [ + "### Run a model evaluation job\n", + "\n", + "AutoSxS relays on Vertex AI pipelines to run model evaluation. And here you can see some of the required pipeline parameters:\n", + "\n", + "* `evaluation_dataset` to indicate where the evaluation dataset location. In this case, it is the JSONL Cloud bucket URI.\n", + "\n", + "* `id_colums` to distinguish evaluation examples that are unique. Here, as you can imagine, your have `id` and `document` fields.\n", + "\n", + "* `task` to indicate the task type you want to evaluate in `{task}@{version}` form. It can be `summarization` or `question_answer`. In this case you have `summarization`.\n", + "\n", + "* `autorater_prompt_parameters` to configure the autorater task behaviour. And you can specify inference instructions to guide task completion, as well as setting the inference context to refer during the task execution.\n", + "\n", + "Lastly, you have to provide `response_column_a` and `response_column_b` with the names of columns containing predefined predictions in order to calculate the evaluation metrics. In this case, `response_a` and `response_b` respectively.\n", + "\n", + "To learn more about all supported parameters, see the [official documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/side-by-side-eval#perform-eval).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Cp7e-hOmNMhA" + }, + "outputs": [], + "source": [ + "display_name = f\"autosxs-eval-{generate_uuid()}\"\n", + "parameters = {\n", + " \"evaluation_dataset\": EVALUATION_FILE_URI,\n", + " \"id_columns\": [\"id\", \"document\"],\n", + " \"task\": \"summarization\",\n", + " \"autorater_prompt_parameters\": {\n", + " \"inference_context\": {\"column\": \"document\"},\n", + " \"inference_instruction\": {\"template\": \"Summarize the following text: \"},\n", + " },\n", + " \"response_column_a\": \"response_a\",\n", + " \"response_column_b\": \"response_b\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bp0YIvSv-zhB" + }, + "source": [ + "After you define the model evaluation parameters, you can run a model evaluation pipeline job using the predifined pipeline template with Vertex AI Python SDK." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AjFHT5ze9m4L" + }, + "outputs": [], + "source": [ + "job = aiplatform.PipelineJob(\n", + " job_id=display_name,\n", + " display_name=display_name,\n", + " pipeline_root=os.path.join(BUCKET_URI, display_name),\n", + " template_path=TEMPLATE_URI,\n", + " parameter_values=parameters,\n", + " enable_caching=False,\n", + ")\n", + "job.run(sync=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PunFdLfqGh0e" + }, + "source": [ + "### Evaluate the results\n", + "\n", + "After the evaluation pipeline successfully run, you can review the evaluation results by looking both at artifacts generated by the pipeline itself in Vertex AI Pipelines UI and in the notebook enviroment using the Vertex AI Python SDK.\n", + "\n", + "AutoSXS produces three types of evaluation results: a judgments table, aggregated metrics, and alignment metrics (if human preferences are provided).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J4ShokOI9FDI" + }, + "source": [ + "### AutoSxS Judgments\n", + "\n", + "The judgments table contains metrics that offer insights of LLM performance per each example.\n", + "\n", + "For each response pair, the judgments table includes a `choice` column indicating the better response based on the evaluation criteria used by the autorater.\n", + "\n", + "Each choice has a `confidence score` column between 0 and 1, representing the autorater's level of confidence in the evaluation.\n", + "\n", + "Last but not less important, AutoSXS provides an explanation for why the autorater preferred one response over the other.\n", + "\n", + "Below you have an example of AutoSxS judgments output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MakdmpYCmehF" + }, + "outputs": [], + "source": [ + "for details in job.task_details:\n", + " if details.task_name == \"autosxs-arbiter\":\n", + " break\n", + "\n", + "judgments_uri = MessageToDict(details.outputs[\"judgments\"]._pb)[\"artifacts\"][0][\"uri\"]\n", + "judgments_df = pd.read_json(judgments_uri, lines=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_gyM2-i3HHnP" + }, + "outputs": [], + "source": [ + "print_autosxs_judgments(judgments_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tJ5PJ9x69KrC" + }, + "source": [ + "### AutoSxS Aggregate metrics\n", + "\n", + "AutoSxS also provides aggregated metrics as an additional evaluation result. These win-rate metrics are calculated by utilizing the judgments table to determine the percentage of times the autorater preferred one model response. \n", + "\n", + "These metrics are relevant for quickly find out which is the best model in the context of the evaluated task.\n", + "\n", + "Below you have an example of AutoSxS Aggregate metrics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w2RISjQSJk9R" + }, + "outputs": [], + "source": [ + "for details in job.task_details:\n", + " if details.task_name == \"autosxs-metrics-computer\":\n", + " break\n", + "\n", + "win_rate_metrics = MessageToDict(details.outputs[\"autosxs_metrics\"]._pb)[\"artifacts\"][\n", + " 0\n", + "][\"metadata\"]\n", + "print_aggregated_metrics(win_rate_metrics)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_5mYmHj6poXz" + }, + "source": [ + "### Human-preference alignment metrics\n", + "\n", + "After reviewing the results of your initial AutoSxS evalution, you may wonder about the reliability of the Autorater assessment's alignment with human raters' views.\n", + "\n", + "AutoSxS supports human preference to validate Autorater evaluation.\n", + "\n", + "To check alignment with a human-preference dataset, you need to add the ground truths as a column to the `evaluation_dataset` and pass the column name to `human_preference_column`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8vSedvz39-iu" + }, + "source": [ + "#### Read the evaluation data\n", + "\n", + "With respect of evaluation dataset, in this case the `sum_human_eval_gemini_dataset_001` dataset also includes human preferences.\n", + "\n", + "Below you have a sample of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mbfsO2uw9-i5" + }, + "outputs": [], + "source": [ + "human_evaluation_gemini_df = pd.read_json(HUMAN_EVALUATION_FILE_URI, lines=True)\n", + "human_evaluation_gemini_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XpmAr6UX-Imb" + }, + "source": [ + "#### Run a model evaluation job\n", + "\n", + "With respect to the AutoSXS pipeline, you must specify the human preference column in the pipeline parameters.\n", + "\n", + "Then, you can run the evaluation pipeline job using the Vertex AI Python SDK as shown below.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bFmvFt2a3MtN" + }, + "outputs": [], + "source": [ + "display_name = f\"autosxs-human-eval-{generate_uuid()}\"\n", + "parameters = {\n", + " \"evaluation_dataset\": HUMAN_EVALUATION_FILE_URI,\n", + " \"id_columns\": [\"id\", \"document\"],\n", + " \"task\": \"summarization\",\n", + " \"autorater_prompt_parameters\": {\n", + " \"inference_context\": {\"column\": \"document\"},\n", + " \"inference_instruction\": {\"template\": \"Summarize the following text: \"},\n", + " },\n", + " \"response_column_a\": \"response_a\",\n", + " \"response_column_b\": \"response_b\",\n", + " \"human_preference_column\": \"actual\",\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KbhIPY-_3SSB" + }, + "outputs": [], + "source": [ + "job = aiplatform.PipelineJob(\n", + " job_id=display_name,\n", + " display_name=display_name,\n", + " pipeline_root=os.path.join(BUCKET_URI, display_name),\n", + " template_path=TEMPLATE_URI,\n", + " parameter_values=parameters,\n", + " enable_caching=False,\n", + ")\n", + "job.run(sync=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QTzJ8BaWEusN" + }, + "source": [ + "### Get human-aligned aggregated metrics\n", + "\n", + "Compared with the aggregated metrics you get before, now the pipeline returns additional measurements that utilize human-preference data provided by you.\n", + "\n", + "In addition to well-known metrics like accuracy, precision, and recall, you will receive both the human preference scores and autorater preference scores. These scores indicate the level of agreement between the evaluations. And to simplify this comparison, Cohen's Kappa is provided. Cohen's Kappa measures the level of agreement between the autorater and human rates. It ranges from 0 to 1, where 0 represents agreement equivalent to a random choice and 1 indicates perfect agreement.\n", + "\n", + "Below you have a view of the resulting human-aligned aggregated metrics.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JLUOJFjA38ja" + }, + "outputs": [], + "source": [ + "for details in job.task_details:\n", + " if details.task_name == \"autosxs-metrics-computer\":\n", + " break\n", + "\n", + "human_aligned_metrics = {\n", + " k: round(v, 3)\n", + " for k, v in MessageToDict(details.outputs[\"autosxs_metrics\"]._pb)[\"artifacts\"][0][\n", + " \"metadata\"\n", + " ].items()\n", + "}\n", + "pprint.pprint(human_aligned_metrics)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TpV-iwP9qw9c" + }, + "source": [ + "## Cleaning up\n", + "\n", + "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", + "\n", + "Otherwise, you can delete the individual resources you created in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sx_vKniMq9ZX" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# Delete Model Evaluation pipeline run\n", + "delete_pipeline = True\n", + "if delete_pipeline or os.getenv(\"IS_TESTING\"):\n", + " job.delete()\n", + "\n", + "# Delete Cloud Storage objects that were created\n", + "delete_bucket = True\n", + "if delete_bucket or os.getenv(\"IS_TESTING\"):\n", + " ! gsutil -m rm -r $BUCKET_URI" + ] + } + ], + "metadata": { + "colab": { + "name": "evaluate_gemini_with_autosxs.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}