diff --git a/gemini/evaluation/get_started_with_genai_model_eval_service.ipynb b/gemini/evaluation/get_started_with_genai_model_eval_service.ipynb
new file mode 100644
index 0000000000..d8ffaebe5d
--- /dev/null
+++ b/gemini/evaluation/get_started_with_genai_model_eval_service.ipynb
@@ -0,0 +1,1768 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ur8xi4C7S06n"
+ },
+ "outputs": [],
+ "source": [
+ "# Copyright 2024 Google LLC\n",
+ "#\n",
+ "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JAPoU8Sm5E6e"
+ },
+ "source": [
+ "# Get started with Generative AI evaluation service\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " Open in Colab\n",
+ " \n",
+ " | \n",
+ " \n",
+ " \n",
+ " Open in Colab Enterprise\n",
+ " \n",
+ " | \n",
+ " \n",
+ " \n",
+ " Open in Workbench\n",
+ " \n",
+ " | \n",
+ " \n",
+ " \n",
+ " View on GitHub\n",
+ " \n",
+ " | \n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sWXnzWLSmqm7"
+ },
+ "source": [
+ "| | |\n",
+ "|-|-|\n",
+ "|Author(s) | [Ivan Nardini](https://github.com/inardini)|"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tvgnzT1CKxrO"
+ },
+ "source": [
+ "## Overview\n",
+ "\n",
+ "Assessing the performance of Large Language Models (LLMs) remains a complex task, especially when it comes to integrating them into production systems. Unlike conventional software and non-generative machine learning models, evaluating LLMs is subjective, challenging to automate, and prone to highly visible errors.\n",
+ "\n",
+ "To tackle these challenges, Vertex AI offers a comprehensive evaluation framework through its Model Evaluation service. This framework encompasses the entire LLM lifecycle, from prompt engineering and model comparison to operationalizing automated model evaluation in production environments.\n",
+ "\n",
+ "Learn more about [Vertex AI Generative AI evaluation service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluate-models)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "d975e698c9a4"
+ },
+ "source": [
+ "### Objective\n",
+ "\n",
+ "In this tutorial, you learn how to use the Vertex AI Model Evalution framework to evaluate Gemini, PaLM2 and Gemma in a summarization task.\n",
+ "\n",
+ "This tutorial uses the following Google Cloud ML services and resources:\n",
+ "\n",
+ "- Vertex AI Model Eval\n",
+ "- Vertex AI Pipelines\n",
+ "- Vertex AI Prediction\n",
+ "\n",
+ "The steps performed include:\n",
+ "\n",
+ "- Use Vertex AI Rapid Eval SDK to find the best prompt for a given model.\n",
+ "- Use Vertex AI Rapid Eval SDK to validate the best prompt across several models.\n",
+ "- Use Vertex AI Model Eval Pipeline service to measure performance and compare models with a more systematic evaluation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "08d289fa873f"
+ },
+ "source": [
+ "### Dataset\n",
+ "\n",
+ "The dataset is a modified sample of the [XSum](https://huggingface.co/datasets/EdinburghNLP/xsum) dataset for evaluation of abstractive single-document summarization systems."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aed92deeb4a0"
+ },
+ "source": [
+ "### Costs\n",
+ "\n",
+ "This tutorial uses billable components of Google Cloud:\n",
+ "\n",
+ "* Vertex AI\n",
+ "* Cloud Storage\n",
+ "\n",
+ "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "i7EUnXsZhAGF"
+ },
+ "source": [
+ "## Installation\n",
+ "\n",
+ "Install the following packages required to execute this notebook."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "dMooEXPbhMw2"
+ },
+ "outputs": [],
+ "source": [
+ "! pip3 install --upgrade --quiet google-cloud-aiplatform[rapid_evaluation]==1.48.0\n",
+ "! pip3 install --upgrade --quiet datasets==2.18.0 --upgrade\n",
+ "! pip3 install --upgrade --quiet plotly==5.20.0 --upgrade\n",
+ "! pip3 install --upgrade --quiet nest-asyncio==1.6.0 --upgrade"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "58707a750154"
+ },
+ "source": [
+ "### Colab only: Uncomment the following cell to restart the kernel."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "f200f10a1da3"
+ },
+ "outputs": [],
+ "source": [
+ "# import IPython\n",
+ "\n",
+ "# app = IPython.Application.instance()\n",
+ "# app.kernel.do_shutdown(True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BF1j6f9HApxa"
+ },
+ "source": [
+ "## Before you begin\n",
+ "\n",
+ "### Set up your Google Cloud project\n",
+ "\n",
+ "**The following steps are required, regardless of your notebook environment.**\n",
+ "\n",
+ "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
+ "\n",
+ "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
+ "\n",
+ "3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n",
+ "\n",
+ "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WReHDGG5g0XY"
+ },
+ "source": [
+ "#### Set your project ID\n",
+ "\n",
+ "**If you don't know your project ID**, try the following:\n",
+ "* Run `gcloud config list`.\n",
+ "* Run `gcloud projects list`.\n",
+ "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "oM1iC_MfAts1"
+ },
+ "outputs": [],
+ "source": [
+ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n",
+ "\n",
+ "# Set the project id\n",
+ "! gcloud config set project {PROJECT_ID}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "region"
+ },
+ "source": [
+ "#### Region\n",
+ "\n",
+ "You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "I6FmBV2_0fBP"
+ },
+ "outputs": [],
+ "source": [
+ "REGION = \"us-central1\" # @param {type: \"string\"}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sBCra4QMA2wR"
+ },
+ "source": [
+ "### Authenticate your Google Cloud account\n",
+ "\n",
+ "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "74ccc9e52986"
+ },
+ "source": [
+ "**1. Vertex AI Workbench**\n",
+ "* Do nothing as you are already authenticated."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "de775a3773ba"
+ },
+ "source": [
+ "**2. Local JupyterLab instance, uncomment and run:**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "254614fa0c46"
+ },
+ "outputs": [],
+ "source": [
+ "# ! gcloud auth login"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ef21552ccea8"
+ },
+ "source": [
+ "**3. Colab, uncomment and run:**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "603adbbf0532"
+ },
+ "outputs": [],
+ "source": [
+ "from google.colab import auth\n",
+ "\n",
+ "auth.authenticate_user()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "f6b2ccc891ed"
+ },
+ "source": [
+ "**4. Service account or other**\n",
+ "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zgPO1eR3CYjk"
+ },
+ "source": [
+ "### Create a Cloud Storage bucket\n",
+ "\n",
+ "Create a storage bucket to store intermediate artifacts such as datasets."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MzGDU7TWdts_"
+ },
+ "outputs": [],
+ "source": [
+ "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-EcIXiGsCePi"
+ },
+ "source": [
+ "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "NIq7R4HZCfIc"
+ },
+ "outputs": [],
+ "source": [
+ "! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "set_service_account"
+ },
+ "source": [
+ "#### Service Account\n",
+ "\n",
+ "**If you don't know your service account**, try to get your service account using `gcloud` command by executing the second cell below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ssUJJqXJJHgC"
+ },
+ "outputs": [],
+ "source": [
+ "SERVICE_ACCOUNT = \"[your-service-account]\" # @param {type:\"string\"}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "autoset_service_account"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import sys\n",
+ "\n",
+ "IS_COLAB = \"google.colab\" in sys.modules\n",
+ "if (\n",
+ " SERVICE_ACCOUNT == \"\"\n",
+ " or SERVICE_ACCOUNT is None\n",
+ " or SERVICE_ACCOUNT == \"[your-service-account]\"\n",
+ "):\n",
+ " # Get your service account from gcloud\n",
+ " if not IS_COLAB:\n",
+ " shell_output = !gcloud auth list 2>/dev/null\n",
+ " SERVICE_ACCOUNT = shell_output[2].replace(\"*\", \"\").strip()\n",
+ "\n",
+ " if IS_COLAB:\n",
+ " shell_output = ! gcloud projects describe $PROJECT_ID\n",
+ " project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n",
+ " SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n",
+ "\n",
+ " print(\"Service Account:\", SERVICE_ACCOUNT)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "set_service_account:pipelines"
+ },
+ "source": [
+ "#### Set service account access\n",
+ "\n",
+ "Run the following commands to grant your service account access"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "bkmKf1RDJHgD"
+ },
+ "outputs": [],
+ "source": [
+ "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI\n",
+ "\n",
+ "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ek1-iTbPjzdJ"
+ },
+ "source": [
+ "### Set tutorial folder\n",
+ "\n",
+ "Set a folder to collect data and any tutorial artifacts."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "BbfKRabXj3la"
+ },
+ "outputs": [],
+ "source": [
+ "from pathlib import Path as path\n",
+ "\n",
+ "root_path = path.cwd()\n",
+ "tutorial_path = root_path / \"tutorial\"\n",
+ "data_path = tutorial_path / \"data\"\n",
+ "\n",
+ "data_path.mkdir(parents=True, exist_ok=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "960505627ddf"
+ },
+ "source": [
+ "### Import libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "PyQmSRbKA8r-"
+ },
+ "outputs": [],
+ "source": [
+ "# General\n",
+ "import logging\n",
+ "import random\n",
+ "import string\n",
+ "import warnings\n",
+ "from typing import Tuple, List\n",
+ "\n",
+ "# GenAI Evaluation\n",
+ "import datasets\n",
+ "import nest_asyncio\n",
+ "import pandas as pd\n",
+ "import plotly.graph_objects as go\n",
+ "import vertexai\n",
+ "from google.cloud import aiplatform\n",
+ "from google.protobuf.json_format import MessageToDict\n",
+ "from IPython.display import HTML, Markdown, display\n",
+ "from tqdm import tqdm\n",
+ "from vertexai.generative_models import GenerativeModel, HarmBlockThreshold, HarmCategory\n",
+ "from vertexai.language_models import TextGenerationModel\n",
+ "from vertexai.preview.evaluation import EvalTask, make_metric"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2mnF6ZiHw1l6"
+ },
+ "source": [
+ "### Libraries settings\n",
+ "\n",
+ "Set warnings, logging and Hugging Face datasets configuration to run tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "JXctBVYrw5Zc"
+ },
+ "outputs": [],
+ "source": [
+ "warnings.filterwarnings(\"ignore\")\n",
+ "nest_asyncio.apply()\n",
+ "datasets.disable_progress_bar()\n",
+ "logging.getLogger(\"urllib3.connectionpool\").setLevel(logging.ERROR)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "init_aip:mbsdk,all"
+ },
+ "source": [
+ "### Initialize Vertex AI SDK for Python\n",
+ "\n",
+ "Initialize the Vertex AI SDK for Python for your project."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "bQMc2Uwf0fBQ"
+ },
+ "outputs": [],
+ "source": [
+ "vertexai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gxc7q4r-DFH4"
+ },
+ "source": [
+ "### Define constants\n",
+ "\n",
+ "Define evalution dataset Cloud Bucket uris, AutoSxS pipeline template and pipeline root to use in this tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "0Y5t67f3DHNm"
+ },
+ "outputs": [],
+ "source": [
+ "AUTO_METRICS_EVALUATION_FILE_URI = (\n",
+ " \"gs://github-repo/evaluate-gemini/sum_eval_palm_dataset_001.jsonl\"\n",
+ ")\n",
+ "\n",
+ "AUTOSXS_EVALUATION_FILE_URI = (\n",
+ " \"gs://github-repo/evaluate-gemini/sum_eval_gemini_dataset_001.jsonl\"\n",
+ ")\n",
+ "\n",
+ "AUTO_SXS_TEMPLATE_URI = (\n",
+ " \"https://us-kfp.pkg.dev/ml-pipeline/google-cloud-registry/autosxs-template/2.11.0\"\n",
+ ")\n",
+ "PIPELINE_ROOT = f\"{BUCKET_URI}/pipeline\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oh6CJp8XCheE"
+ },
+ "source": [
+ "### Helper functions\n",
+ "\n",
+ "Initialize some helper functions to display evaluation results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "gT_OJBHfCg4Q"
+ },
+ "outputs": [],
+ "source": [
+ "def generate_uuid(length: int = 8) -> str:\n",
+ " \"\"\"Generate a uuid of a specifed length (default=8).\"\"\"\n",
+ " return \"\".join(random.choices(string.ascii_lowercase + string.digits, k=length))\n",
+ "\n",
+ "\n",
+ "def display_eval_report(\n",
+ " eval_result: Tuple[str, dict, pd.DataFrame], metrics: List[str] = None\n",
+ ") -> None:\n",
+ " \"\"\"Display the evaluation results.\"\"\"\n",
+ "\n",
+ " title, summary_metrics, report_df = eval_result\n",
+ " metrics_df = pd.DataFrame.from_dict(summary_metrics, orient=\"index\").T\n",
+ " if metrics:\n",
+ " metrics_df = metrics_df.filter(\n",
+ " [\n",
+ " metric\n",
+ " for metric in metrics_df.columns\n",
+ " if any(selected_metric in metric for selected_metric in metrics)\n",
+ " ]\n",
+ " )\n",
+ " report_df = report_df.filter(\n",
+ " [\n",
+ " metric\n",
+ " for metric in report_df.columns\n",
+ " if any(selected_metric in metric for selected_metric in metrics)\n",
+ " ]\n",
+ " )\n",
+ "\n",
+ " # Display the title with Markdown for emphasis\n",
+ " display(Markdown(f\"## {title}\"))\n",
+ "\n",
+ " # Display the metrics DataFrame\n",
+ " display(Markdown(\"### Summary Metrics\"))\n",
+ " display(metrics_df)\n",
+ "\n",
+ " # Display the detailed report DataFrame\n",
+ " display(Markdown(\"### Report Metrics\"))\n",
+ " display(report_df)\n",
+ "\n",
+ "\n",
+ "def display_explanations(\n",
+ " df: pd.DataFrame, metrics: List[str] = None, n: int = 1\n",
+ ") -> None:\n",
+ " \"\"\"Display the explanations for the evaluation results.\"\"\"\n",
+ "\n",
+ " # Set the style\n",
+ " style = \"white-space: pre-wrap; width: 800px; overflow-x: auto;\"\n",
+ "\n",
+ " # Sample the DataFrame\n",
+ " df = df.sample(n=n)\n",
+ "\n",
+ " # Filter the DataFrame based on the selected metrics\n",
+ " if metrics:\n",
+ " df = df.filter(\n",
+ " [\"context\", \"reference\", \"completed_prompt\", \"response\"]\n",
+ " + [\n",
+ " metric\n",
+ " for metric in df.columns\n",
+ " if any(selected_metric in metric for selected_metric in metrics)\n",
+ " ]\n",
+ " )\n",
+ "\n",
+ " # Display the explanations\n",
+ " for index, row in df.iterrows():\n",
+ " for col in df.columns:\n",
+ " display(HTML(f\"{col}:
{row[col]}
\"))\n",
+ " display(HTML(\"
\"))\n",
+ "\n",
+ "\n",
+ "def plot_radar_plot(eval_results, metrics=None):\n",
+ " \"\"\"Plot a radar plot for the evaluation results.\"\"\"\n",
+ "\n",
+ " # Set the figure\n",
+ " fig = go.Figure()\n",
+ "\n",
+ " # Create the radar plot for the evaluation metrics\n",
+ " for eval_result in eval_results:\n",
+ " title, summary_metrics, report_df = eval_result\n",
+ "\n",
+ " if metrics:\n",
+ " summary_metrics = {\n",
+ " k: summary_metrics[k]\n",
+ " for k, v in summary_metrics.items()\n",
+ " if any(selected_metric in k for selected_metric in metrics)\n",
+ " }\n",
+ "\n",
+ " fig.add_trace(\n",
+ " go.Scatterpolar(\n",
+ " r=list(summary_metrics.values()),\n",
+ " theta=list(summary_metrics.keys()),\n",
+ " fill=\"toself\",\n",
+ " name=title,\n",
+ " )\n",
+ " )\n",
+ "\n",
+ " # Update figure layout\n",
+ " fig.update_layout(\n",
+ " polar=dict(radialaxis=dict(visible=True, range=[0, 5])), showlegend=True\n",
+ " )\n",
+ "\n",
+ " fig.show()\n",
+ "\n",
+ "\n",
+ "def plot_bar_plot(\n",
+ " eval_results: Tuple[str, dict, pd.DataFrame], metrics: List[str] = None\n",
+ ") -> None:\n",
+ " \"\"\"Plot a bar plot for the evaluation results.\"\"\"\n",
+ "\n",
+ " # Create data for the bar plot\n",
+ " data = []\n",
+ " for eval_result in eval_results:\n",
+ " title, summary_metrics, _ = eval_result\n",
+ " if metrics:\n",
+ " summary_metrics = {\n",
+ " k: summary_metrics[k]\n",
+ " for k, v in summary_metrics.items()\n",
+ " if any(selected_metric in k for selected_metric in metrics)\n",
+ " }\n",
+ "\n",
+ " data.append(\n",
+ " go.Bar(\n",
+ " x=list(summary_metrics.keys()),\n",
+ " y=list(summary_metrics.values()),\n",
+ " name=title,\n",
+ " )\n",
+ " )\n",
+ "\n",
+ " # Update the figure with the data\n",
+ " fig = go.Figure(data=data)\n",
+ "\n",
+ " # Change the bar mode\n",
+ " fig.update_layout(barmode=\"group\")\n",
+ " fig.show()\n",
+ "\n",
+ "\n",
+ "def print_aggregated_metrics(job: aiplatform.PipelineJob) -> None:\n",
+ " \"\"\"Print AutoMetrics\"\"\"\n",
+ "\n",
+ " # Collect rougeLSum\n",
+ " rougeLSum = round(job.rougeLSum, 3) * 100\n",
+ "\n",
+ " # Display the metric\n",
+ " display(\n",
+ " HTML(\n",
+ " f\"The {rougeLSum}% of the reference summary is represented by LLM when considering the longest common subsequence (LCS) of words.
\"\n",
+ " )\n",
+ " )\n",
+ "\n",
+ "\n",
+ "def print_autosxs_judgments(df: pd.DataFrame, n: int = 3):\n",
+ " \"\"\"Print AutoSxS judgments\"\"\"\n",
+ "\n",
+ " # Set the style\n",
+ " style = \"white-space: pre-wrap; width: 800px; overflow-x: auto;\"\n",
+ "\n",
+ " # Sample the dataframe\n",
+ " df = df.sample(n=n)\n",
+ "\n",
+ " # Display the autorater explanations\n",
+ " for index, row in df.iterrows():\n",
+ " if row[\"confidence\"] >= 0.5:\n",
+ " display(\n",
+ " HTML(f\"Document:
{row['document']}
\")\n",
+ " )\n",
+ " display(\n",
+ " HTML(\n",
+ " f\"Response A:
{row['response_a']}
\"\n",
+ " )\n",
+ " )\n",
+ " display(\n",
+ " HTML(\n",
+ " f\"Response B:
{row['response_b']}
\"\n",
+ " )\n",
+ " )\n",
+ " display(\n",
+ " HTML(\n",
+ " f\"Explanation:
{row['explanation']}
\"\n",
+ " )\n",
+ " )\n",
+ " display(\n",
+ " HTML(\n",
+ " f\"Confidence score:
{row['confidence']}
\"\n",
+ " )\n",
+ " )\n",
+ " display(HTML(\"
\"))\n",
+ "\n",
+ "\n",
+ "def print_autosxs_win_metrics(scores: dict) -> None:\n",
+ " \"\"\"Print AutoSxS aggregated metrics\"\"\"\n",
+ "\n",
+ " score_b = round(scores[\"autosxs_model_b_win_rate\"] * 100)\n",
+ " display(\n",
+ " HTML(\n",
+ " f\"AutoSxS Autorater prefers {score_b}% of time Model B over Model A
\"\n",
+ " )\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_FD1yJjuF3_g"
+ },
+ "source": [
+ "### Initiate LLMs\n",
+ "\n",
+ "Initialize LLMs to evaluate."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "RF1VlV0fJhCK"
+ },
+ "outputs": [],
+ "source": [
+ "llm1_model = GenerativeModel(\"gemini-pro\")\n",
+ "llm2_model = TextGenerationModel.from_pretrained(\"text-bison@002\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YfxiQXYN6YEt"
+ },
+ "source": [
+ "## Vertex AI Model Evaluation for prompt engineering and model comparison using Rapid Eval SDK\n",
+ "\n",
+ "To create more effective prompts that generate better output, you need to repeatedly test different prompts and interact with the LLMs multiple times to evaluate and validate your prompts.\n",
+ "\n",
+ "The Rapid evaluation service allows you to evaluate prompts on demand using small data batches. You can use both predefined and custom metrics. And you can use the evaluation outputs in downstream representations for better understanding.\n",
+ "\n",
+ "To use Rapid Eval SDK, you may want to cover the following steps:\n",
+ "\n",
+ "1. Initiate the evaluation dataset\n",
+ "2. Define prompt templates and metrics\n",
+ "3. Provide some model configurations\n",
+ "4. Intiate an Evaluation Task\n",
+ "5. Run an evaluation job"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pKPe8i8pdwin"
+ },
+ "source": [
+ "### Initiate the evaluation dataset\n",
+ "\n",
+ "Prepare the dataset to evaluate prompts and compare models."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ranK-Np0zrEG"
+ },
+ "outputs": [],
+ "source": [
+ "eval_dataset = datasets.load_dataset(\"xsum\", split=\"all\", data_dir=data_path)\n",
+ "eval_dataset = (\n",
+ " eval_dataset.filter(lambda example: len(example[\"document\"]) < 4096)\n",
+ " .filter(lambda example: len(example[\"summary\"]) < 4096)\n",
+ " .rename_columns({\"document\": \"context\", \"summary\": \"reference\"})\n",
+ " .remove_columns([\"id\"])\n",
+ ")\n",
+ "\n",
+ "eval_sample_df = (\n",
+ " eval_dataset.shuffle(seed=8)\n",
+ " .select(random.sample(range(0, len(eval_dataset)), 3))\n",
+ " .to_pandas()\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "dZpLWBsSzSvZ"
+ },
+ "outputs": [],
+ "source": [
+ "eval_sample_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TmYV0rMYOI4G"
+ },
+ "source": [
+ "### Evaluate prompt engineering using predefined metrics\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "p6dAMM4AIFMT"
+ },
+ "source": [
+ "#### Define prompt templates and metrics\n",
+ "\n",
+ "You provide some prompt templates you want to evaluate. Also you pass evaluation metrics. The metrics you choose will depend on whether or not you have access to ground truth data. If you have ground truth data, you can use computation-based metrics. If you don't have ground truth data, you can use pairwise model-based metrics. Check out the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview#determine-eval) to know more."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "I0YS5X9IlRVS"
+ },
+ "outputs": [],
+ "source": [
+ "prompt_templates = [\n",
+ " \"Summarize the following article: {context}\",\n",
+ " \"Summarize the following article in one main sentence: {context}\",\n",
+ " \"Summarize the following article in three main sentences: {context}\",\n",
+ "]\n",
+ "\n",
+ "metrics = [\"rouge_l_sum\", \"fluency\", \"coherence\", \"safety\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9v8ObTwYpZaX"
+ },
+ "source": [
+ "#### Set model parameters\n",
+ "\n",
+ "Set both the generation and the safety settings of the LLM. For more information, see the [Vertex AI documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/overview).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MQHfkB_NJgft"
+ },
+ "outputs": [],
+ "source": [
+ "generation_config = {\n",
+ " \"max_output_tokens\": 128,\n",
+ " \"temperature\": 0.8,\n",
+ "}\n",
+ "\n",
+ "safety_settings = {\n",
+ " HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,\n",
+ " HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,\n",
+ " HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,\n",
+ " HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,\n",
+ " HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,\n",
+ "}\n",
+ "\n",
+ "llm1_model.generation_config = generation_config\n",
+ "llm1_model.safety_settings = safety_settings"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8iWfF1DPqVQU"
+ },
+ "source": [
+ "#### Run the evaluation\n",
+ "\n",
+ "To run evaluations for prompt templates, you run an evaluation job repeatedly against an evaluation dataset and its associated metrics. With EvalTask, you leverage integration with Vertex AI Experiments to track settings and results for each evaluation run."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "_HUstMFvqUwl"
+ },
+ "outputs": [],
+ "source": [
+ "experiment_name = \"rapid-eval-with-llm1\"\n",
+ "\n",
+ "eval_task = EvalTask(\n",
+ " dataset=eval_sample_df,\n",
+ " metrics=metrics,\n",
+ " experiment=experiment_name,\n",
+ " content_column_name=\"content\",\n",
+ " reference_column_name=\"reference\",\n",
+ ")\n",
+ "\n",
+ "run_id = generate_uuid()\n",
+ "\n",
+ "eval_results = []\n",
+ "\n",
+ "for i, prompt_template in tqdm(\n",
+ " enumerate(prompt_templates), total=len(prompt_templates)\n",
+ "):\n",
+ " experiment_run_name = f\"prompt-evaluation-llm1-{run_id}-{i}\"\n",
+ "\n",
+ " eval_result = eval_task.evaluate(\n",
+ " model=llm1_model,\n",
+ " prompt_template=prompt_template,\n",
+ " experiment_run_name=experiment_run_name,\n",
+ " )\n",
+ "\n",
+ " eval_results.append(\n",
+ " (f\"Prompt #{i}\", eval_result.summary_metrics, eval_result.metrics_table)\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3gTNR5s4reWQ"
+ },
+ "source": [
+ "#### Display Evaluation reports and explanations\n",
+ "\n",
+ "Display detailed evaluation reports, explanations, and useful charts to summarize key metrics in an informative manner."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "eMHH_R83pZ3S"
+ },
+ "outputs": [],
+ "source": [
+ "for eval_result in eval_results:\n",
+ " display_eval_report(eval_result)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "6Vft3PqFTsZR"
+ },
+ "outputs": [],
+ "source": [
+ "for eval_result in eval_results:\n",
+ " display_explanations(eval_result[2], metrics=[\"fluency\"])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Kr9eCzXLHRmJ"
+ },
+ "outputs": [],
+ "source": [
+ "plot_radar_plot(eval_results, metrics=[\"fluency/mean\", \"coherence/mean\", \"safety/mean\"])\n",
+ "plot_bar_plot(eval_results, metrics=[\"rouge_l_sum/mean\"])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "f0WSDZJV_aL1"
+ },
+ "source": [
+ "**Comment**: You evaluate your prompts by passing metrics manually. Rapid Eval SDK supports metric-bundles which group metrics based on their tasks/criterias/inputs to facilitate convenient usage. For more information, check out the [Metrics bundles](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#metric-bundles) documentation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9dnGxH9m-09p"
+ },
+ "source": [
+ "### Evaluate prompt engineering using locally-defined custom metrics\n",
+ "\n",
+ "To evaluate your prompts with a custom metric, you need to define and register a function that encapsulates metric logic as an evaluation metric. In this case you define a custom faithfulness with score and explanation. Once these steps are completed, you can use the metric directly in the evaluation task.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gE37qa56AZcu"
+ },
+ "source": [
+ "##### Register CustomMetrics locally\n",
+ "\n",
+ "Use helper function `make_metric` function to register a customly defined metric function.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ZvwL1WszAo_1"
+ },
+ "outputs": [],
+ "source": [
+ "def custom_faithfulness(instance: str) -> dict:\n",
+ " \n",
+ " \"\"\"Evaluates the faithfulness of a text using an LLM.\"\"\"\n",
+ "\n",
+ " response = instance[\"response\"]\n",
+ "\n",
+ " def generate_prompt(task, score=None):\n",
+ " \"\"\"\n",
+ " Generates a prompt for the LLM based on the task and optional score.\n",
+ " \"\"\"\n",
+ " prompt_start = f\"\"\"You are examining written text content. Here is the text:\n",
+ "\n",
+ " [BEGIN DATA]\n",
+ " ************\n",
+ " [Text]: {response}\n",
+ " ************\n",
+ " [END DATA]\n",
+ "\n",
+ " \"\"\"\n",
+ " if task == \"score\":\n",
+ " prompt_end = \"\"\"Examine the text and determine whether the text is faithfull or not.\n",
+ " Faithfulness refers to how accurately a generated summary reflects the essential information and key concepts present in the original source document.\n",
+ " A faithful summary stays true to the facts and meaning of the source text, without introducing distortions, hallucinations, or information that wasn't originally there.\n",
+ " Your response must be single integer number on a scale of 0-5, 0 the least faithfull and 5 being the most faithfull.\"\"\" \n",
+ " \n",
+ " elif task == \"explain\":\n",
+ " prompt_end = f\"\"\"Consider the text has been scored as {score} in faithfull using the following definition:\n",
+ " Faithfulness refers to how accurately a generated summary reflects the essential information and key concepts present in the original source document.\n",
+ " A faithful summary stays true to the facts and meaning of the source text, without introducing distortions, hallucinations, or information that wasn't originally there.\n",
+ " Your response must be an explanation of why the text is faithfull or not. If score is -1.0, return \"No explanation can be provided for this prompt.\"\"\"\"\n",
+ " else:\n",
+ " raise ValueError(\"Invalid task for prompt generation.\")\n",
+ " return prompt_start + prompt_end\n",
+ " \n",
+ " # Generate score prompt and extract the score\n",
+ " score_prompt = generate_prompt(\"score\")\n",
+ " score_text = llm1_model.generate_content(score_prompt).candidates[0].content.parts[0].text\n",
+ " \n",
+ " try:\n",
+ " score = int(score_text) / 1.0\n",
+ " except ValueError:\n",
+ " score = -1.0\n",
+ "\n",
+ " # Generate explanation prompt and extract explanation\n",
+ " explanation_prompt = generate_prompt(\"explain\", score)\n",
+ " explanation = llm_model.generate_content(explanation_prompt).candidates[0].content.parts[0].text\n",
+ "\n",
+ " return {\n",
+ " \"custom_faithfulness\": score,\n",
+ " \"explanation\": f\"The model gave a score of {score} with the following explanation: {explanation}\",\n",
+ " }\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "1FJLSmrMAV8C"
+ },
+ "outputs": [],
+ "source": [
+ "custom_faithfulness_metric = make_metric(\n",
+ " name=\"custom_faithfulness\",\n",
+ " metric_function=custom_faithfulness,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "e4BVMy7xRdfS"
+ },
+ "source": [
+ "##### Run the evaluation using the custom metric\n",
+ "\n",
+ "Run evaluations for prompt templates against an evaluation dataset with the defined custom metrics."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "YB4BQAAe-4tb"
+ },
+ "outputs": [],
+ "source": [
+ "experiment_name = \"rapid-eval-with-llm1-custom-eval\"\n",
+ "\n",
+ "eval_task = EvalTask(\n",
+ " dataset=eval_sample_df,\n",
+ " metrics=metrics + [custom_faithfulness_metric],\n",
+ " experiment=experiment_name,\n",
+ ")\n",
+ "\n",
+ "run_id = generate_uuid()\n",
+ "\n",
+ "eval_results = []\n",
+ "\n",
+ "for i, prompt_template in tqdm(\n",
+ " enumerate(prompt_templates), total=len(prompt_templates)\n",
+ "):\n",
+ " experiment_run_name = f\"prompt-evaluation-llm1-{run_id}-{i}\"\n",
+ "\n",
+ " eval_result = eval_task.evaluate(\n",
+ " model=llm1_model,\n",
+ " prompt_template=prompt_template,\n",
+ " experiment_run_name=experiment_run_name,\n",
+ " )\n",
+ "\n",
+ " eval_results.append(\n",
+ " (f\"Prompt #{i}\", eval_result.summary_metrics, eval_result.metrics_table)\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "k_8ZYiyHNIK6"
+ },
+ "source": [
+ "#### Display Evaluation reports and explanations\n",
+ "\n",
+ "Display the resulting evaluation reports and explanations for the custom metrics."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "bwbNJyzJAh8P"
+ },
+ "outputs": [],
+ "source": [
+ "for eval_result in eval_results:\n",
+ " display_eval_report(eval_result, metrics=[\"row\", \"custom_faithfulness\"])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SKMwsVWpVS54"
+ },
+ "outputs": [],
+ "source": [
+ "for eval_result in eval_results:\n",
+ " display_explanations(eval_result[2], metrics=[\"custom_faithfulness\"])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7Mrob5K8pPTG"
+ },
+ "source": [
+ "### Validate prompt by comparing LLM 1 with LLM 2\n",
+ "\n",
+ "Once you know which is the best prompt template according to your metrics, you can validate it across several models.\n",
+ "\n",
+ "Vertex AI Rapid Eval SDK allows you to compare any models, including Google proprietary and open models, against an evaluation dataset with a prompt template and the defined metrics."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "JhlLLWRgrpe6"
+ },
+ "outputs": [],
+ "source": [
+ "prompt_template = \"Summarize the following article in three main sentences: {context}\" # @param {type:\"string\"}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-fWKfGPgPFAr"
+ },
+ "source": [
+ "#### Set model function\n",
+ "\n",
+ "To compare a model which is not natevily supported by the Vertex AI Rapid SDK, you can define a generate function. The function takes the prompt as input and generate a text as output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "qdEfm01LRZEB"
+ },
+ "outputs": [],
+ "source": [
+ "def llm2_model_fn(prompt):\n",
+ " return llm2_model.predict(prompt, **generation_config).text"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "F74Hi6otRbGF"
+ },
+ "source": [
+ "#### Run the evaluation\n",
+ "\n",
+ "To run evaluations along models, you run an evaluation job against an evaluation dataset and its associated metrics using EvalTask."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "zs6OJEGdmQO_"
+ },
+ "outputs": [],
+ "source": [
+ "experiment_name = \"rapid-eval-llm1-llm2-comparison\"\n",
+ "\n",
+ "models = {\n",
+ " \"llm1\": llm1_model,\n",
+ " \"llm2\": llm2_model_fn,\n",
+ "}\n",
+ "\n",
+ "metrics = [\n",
+ " \"bleu\",\n",
+ " \"rouge_1\",\n",
+ " \"rouge_2\",\n",
+ " \"rouge_l\",\n",
+ " \"rouge_l_sum\",\n",
+ " \"fluency\",\n",
+ " \"coherence\",\n",
+ " \"safety\",\n",
+ "]\n",
+ "\n",
+ "eval_task = EvalTask(\n",
+ " dataset=eval_sample_df, metrics=metrics, experiment=experiment_name\n",
+ ")\n",
+ "\n",
+ "run_id = generate_uuid()\n",
+ "\n",
+ "eval_results = []\n",
+ "\n",
+ "for i, (model_name, model) in tqdm(\n",
+ " enumerate(zip(models.keys(), models.values())), total=len(models.keys())\n",
+ "):\n",
+ " experiment_run_name = f\"prompt-evaluation-{model_name}-{run_id}-{i}\"\n",
+ "\n",
+ " eval_result = eval_task.evaluate(\n",
+ " model=model,\n",
+ " prompt_template=prompt_template,\n",
+ " experiment_run_name=experiment_run_name,\n",
+ " )\n",
+ "\n",
+ " eval_results.append(\n",
+ " (f\"Model {model_name}\", eval_result.summary_metrics, eval_result.metrics_table)\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Jc7hqp3jRzUF"
+ },
+ "source": [
+ "#### Display Evaluation reports and explanations\n",
+ "\n",
+ "Display the resulting evaluation reports and explanations for each model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "UW6_U6E0SA3f"
+ },
+ "outputs": [],
+ "source": [
+ "for eval_result in eval_results:\n",
+ " display_eval_report(eval_result)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "QzXEMRKJbtBF"
+ },
+ "outputs": [],
+ "source": [
+ "plot_radar_plot(eval_results, metrics=[\"fluency/mean\", \"coherence/mean\", \"safety/mean\"])\n",
+ "plot_bar_plot(\n",
+ " eval_results,\n",
+ " metrics=[\n",
+ " \"bleu/mean\",\n",
+ " \"rouge_1/mean\",\n",
+ " \"rouge_2/mean\",\n",
+ " \"rouge_l/mean\",\n",
+ " \"rouge_l_sum/mean\",\n",
+ " ],\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VmhTNyiDcHf8"
+ },
+ "source": [
+ "## Vertex AI Model Evaluation at scale\n",
+ "\n",
+ "If you're planning to deploy or are already using your GenAI application with a prompt template and model, you might want to consider evaluating the model or comparing different models over a larger set of data over time.\n",
+ "\n",
+ "In this scenario, you need a more systematic and scalable way to evaluate GenAI application components.\n",
+ "\n",
+ "The Vertex AI Eval service provides end-to-end prebuilt evaluation pipelines for evaluating generative AI models at scale by leveraging Vertex AI Pipelines. Two distinct evaluation pipelines are available:\n",
+ "\n",
+ "* Computation-based for pointwise metric-based evaluation.\n",
+ "* AutoSxS for pairwise model-based evaluations.\n",
+ "\n",
+ "To learn more, [check out](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview#pipeline_services_autosxs_and_computation-based) the official documentation.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "v86hSlFHe38D"
+ },
+ "source": [
+ "### Using Vertex AI Model Evaluation Computation-based metrics.\n",
+ "\n",
+ "You use Computation-based metrics (RougeLSum) for evaluating an LLM in summarization task.\n",
+ "\n",
+ "To run an Computation-based evaluation pipeline job, you need to provide an evaluation dataset which contains both context and the ground truth summaries. Then you define an evaluation task configuration, in this case you use text summarization task. Finally, you submit an evaluation job associated with your LLM."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZEIlO0eHbsQh"
+ },
+ "source": [
+ "#### Read the evaluation data\n",
+ "\n",
+ "Read the evaluation dataset as Pandas DataFrame for having a quick view."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "R-_ettKRxfxT"
+ },
+ "outputs": [],
+ "source": [
+ "evaluation_df = pd.read_json(AUTO_METRICS_EVALUATION_FILE_URI, lines=True)\n",
+ "evaluation_df = evaluation_df.rename(\n",
+ " columns={\"prompt\": \"input_text\", \"ground_truth\": \"output_text\"}\n",
+ ")\n",
+ "evaluation_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1lZHraNFkDz8"
+ },
+ "source": [
+ "#### Run a model evaluation job\n",
+ "\n",
+ "Define a specification for text summarization model evaluation task and run the evaluation job.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "yKg_5Yd7K_Mr"
+ },
+ "outputs": [],
+ "source": [
+ "task_spec = EvaluationTextSummarizationSpec(\n",
+ " ground_truth_data=evaluation_df, task_name=\"summarization\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "AjFHT5ze9m4L"
+ },
+ "outputs": [],
+ "source": [
+ "job = llm2_model.evaluate(task_spec=task_spec)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PunFdLfqGh0e"
+ },
+ "source": [
+ "#### Evaluate the results\n",
+ "\n",
+ "Display resulting metrics.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "nfPxYfeD2K-f"
+ },
+ "outputs": [],
+ "source": [
+ "print_aggregated_metrics(job)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2dPLc6SmmmdL"
+ },
+ "source": [
+ "### Using Vertex AI Model Evaluation AutoSxS metrics\n",
+ "\n",
+ "You use AutoSxS to compare different model responeses for evaluating how better a model is able to generate summaries against another.\n",
+ "\n",
+ "To run an AutoSxS evaluation job, you need to provide an evaluation dataset which contains context and responses of models you want to compare. Then you define AutoSxS parameters including task to evaluate, inference context and instructions and model response columns. Finally, you submit an evaluation job associated with your LLM."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "m3xclNvXmj3L"
+ },
+ "source": [
+ "#### Read the evaluation data\n",
+ "\n",
+ "Read the evaluation dataset as Pandas DataFrame for having a quick view."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "rYK2Lg0fmj3R"
+ },
+ "outputs": [],
+ "source": [
+ "evaluation_df = pd.read_json(AUTOSXS_EVALUATION_FILE_URI, lines=True)\n",
+ "evaluation_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qiVMRP8Emj3S"
+ },
+ "source": [
+ "#### Run a model evaluation job\n",
+ "\n",
+ "Define AutoSxS parameters and run the AutoSxS evalution pipeline job.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Cp7e-hOmNMhA"
+ },
+ "outputs": [],
+ "source": [
+ "display_name = f\"autosxs-eval-{generate_uuid()}\"\n",
+ "parameters = {\n",
+ " \"evaluation_dataset\": AUTOSXS_EVALUATION_FILE_URI,\n",
+ " \"id_columns\": [\"id\", \"document\"],\n",
+ " \"task\": \"summarization\",\n",
+ " \"autorater_prompt_parameters\": {\n",
+ " \"inference_context\": {\"column\": \"document\"},\n",
+ " \"inference_instruction\": {\n",
+ " \"template\": \"Summarize the following article in three main sentences: \"\n",
+ " },\n",
+ " },\n",
+ " \"response_column_a\": \"response_a\",\n",
+ " \"response_column_b\": \"response_b\",\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Y4LijMHHmj3T"
+ },
+ "outputs": [],
+ "source": [
+ "job = aiplatform.PipelineJob(\n",
+ " job_id=display_name,\n",
+ " display_name=display_name,\n",
+ " pipeline_root=os.path.join(BUCKET_URI, display_name),\n",
+ " template_path=AUTO_SXS_TEMPLATE_URI,\n",
+ " parameter_values=parameters,\n",
+ " enable_caching=False,\n",
+ ")\n",
+ "job.run(sync=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UtZSYF37mj3T"
+ },
+ "source": [
+ "#### Evaluate the results\n",
+ "\n",
+ "Vertex AI AutoSxS evaluation pipeline produces the following artifacts:\n",
+ "\n",
+ "\n",
+ "* The judgments table is produced by the AutoSxS arbiter helping users understand model performance at the example level.\n",
+ "* Aggregate metrics are produced by the AutoSxS metrics component helping users understand the most performing model compare to the task under evaluation.\n",
+ "\n",
+ "To know more about the AutoSxS artifacts, [check out](https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval#view-eval-results) the documentation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "J4ShokOI9FDI"
+ },
+ "source": [
+ "##### AutoSxS Judgments"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MakdmpYCmehF"
+ },
+ "outputs": [],
+ "source": [
+ "for details in job.task_details:\n",
+ " if details.task_name == \"online-evaluation-pairwise\":\n",
+ " break\n",
+ "\n",
+ "judgments_uri = MessageToDict(details.outputs[\"judgments\"]._pb)[\"artifacts\"][0][\"uri\"]\n",
+ "judgments_df = pd.read_json(judgments_uri, lines=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "_gyM2-i3HHnP"
+ },
+ "outputs": [],
+ "source": [
+ "print_autosxs_judgments(judgments_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tJ5PJ9x69KrC"
+ },
+ "source": [
+ "##### AutoSxS Aggregate metrics"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "w2RISjQSJk9R"
+ },
+ "outputs": [],
+ "source": [
+ "for details in job.task_details:\n",
+ " if details.task_name == \"model-evaluation-text-generation-pairwise\":\n",
+ " break\n",
+ "\n",
+ "win_rate_metrics = MessageToDict(details.outputs[\"autosxs_metrics\"]._pb)[\"artifacts\"][\n",
+ " 0\n",
+ "][\"metadata\"]\n",
+ "print_autosxs_win_metrics(win_rate_metrics)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TpV-iwP9qw9c"
+ },
+ "source": [
+ "## Cleaning up\n",
+ "\n",
+ "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n",
+ "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n",
+ "\n",
+ "Otherwise, you can delete the individual resources you created in this tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "sx_vKniMq9ZX"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "# Delete Experiments\n",
+ "delete_experiments = True\n",
+ "if delete_experiments or os.getenv(\"IS_TESTING\"):\n",
+ " experiments_list = aiplatform.Experiment.list()\n",
+ " for experiment in experiments_list:\n",
+ " experiment.delete()\n",
+ "\n",
+ "# Delete Pipeline\n",
+ "delete_pipeline = False\n",
+ "if delete_pipeline or os.getenv(\"IS_TESTING\"):\n",
+ " pipelines_list = aiplatform.Pipeline.list()\n",
+ " for pipeline in pipelines_list:\n",
+ " pipeline.delete()\n",
+ "\n",
+ "# Delete Cloud Storage\n",
+ "delete_bucket = True\n",
+ "if delete_bucket or os.getenv(\"IS_TESTING\"):\n",
+ " ! gsutil -m rm -r $BUCKET_URI"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "name": "get_started_with_genai_model_eval_service.ipynb",
+ "toc_visible": true
+ },
+ "environment": {
+ "kernel": "python3",
+ "name": "tf2-cpu.2-11.m116",
+ "type": "gcloud",
+ "uri": "gcr.io/deeplearning-platform-release/tf2-cpu.2-11:m116"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/gemini/grounding/intro-grounding-gemini.ipynb b/gemini/grounding/intro-grounding-gemini.ipynb
index a2613dfcd9..d34bf87cd1 100644
--- a/gemini/grounding/intro-grounding-gemini.ipynb
+++ b/gemini/grounding/intro-grounding-gemini.ipynb
@@ -83,7 +83,7 @@
"source": [
"## Overview\n",
"\n",
- "[Grounding in Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/grounding/ground-language-models) lets you use generative text models to generate content grounded in your own documents and data. This capability lets the model access information at runtime that goes beyond its training data. By grounding model responses in Google Search results or data stores within [Vertex AI Search](https://cloud.google.com/generative-ai-app-builder/docs/enterprise-search-introduction), LLMs that are grounded in data can produce more accurate, up-to-date, and relevant responses.\n",
+ "[Grounding in Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/ground-gemini) lets you use generative text models to generate content grounded in your own documents and data. This capability lets the model access information at runtime that goes beyond its training data. By grounding model responses in Google Search results or data stores within [Vertex AI Search](https://cloud.google.com/generative-ai-app-builder/docs/enterprise-search-introduction), LLMs that are grounded in data can produce more accurate, up-to-date, and relevant responses.\n",
"\n",
"Grounding provides the following benefits:\n",
"\n",