From de828e5faa23d8a74bf18a40cb18d697c45868f4 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu, 18 Apr 2024 11:26:04 -0500 Subject: [PATCH 1/2] chore(deps): bump the pip group across 2 directories with 1 update (#554) Bumps the pip group with 1 update in the /gemini/sample-apps/genwealth/function-scripts/analyze-prospectus directory: [aiohttp](https://github.com/aio-libs/aiohttp). Bumps the pip group with 1 update in the /gemini/sample-apps/genwealth/function-scripts/process-pdf directory: [aiohttp](https://github.com/aio-libs/aiohttp). Updates `aiohttp` from 3.9.3 to 3.9.4
Release notes

Sourced from aiohttp's releases.

3.9.4

Bug fixes

... (truncated)

Changelog

Sourced from aiohttp's changelog.

3.9.4 (2024-04-11)

Bug fixes

... (truncated)

Commits

Updates `aiohttp` from 3.9.3 to 3.9.4
Release notes

Sourced from aiohttp's releases.

3.9.4

Bug fixes

... (truncated)

Changelog

Sourced from aiohttp's changelog.

3.9.4 (2024-04-11)

Bug fixes

... (truncated)

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) ---
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore ` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore ` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore ` will remove the ignore condition of the specified dependency and ignore conditions You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/GoogleCloudPlatform/generative-ai/network/alerts).
Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- .../function-scripts/analyze-prospectus/requirements.txt | 2 +- .../genwealth/function-scripts/process-pdf/requirements.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/gemini/sample-apps/genwealth/function-scripts/analyze-prospectus/requirements.txt b/gemini/sample-apps/genwealth/function-scripts/analyze-prospectus/requirements.txt index 73519c4e55..3ad5db7f94 100644 --- a/gemini/sample-apps/genwealth/function-scripts/analyze-prospectus/requirements.txt +++ b/gemini/sample-apps/genwealth/function-scripts/analyze-prospectus/requirements.txt @@ -1,5 +1,5 @@ functions-framework==3.* -aiohttp==3.9.3 +aiohttp==3.9.4 aiomysql==0.2.0 aiosignal==1.3.1 annotated-types==0.6.0 diff --git a/gemini/sample-apps/genwealth/function-scripts/process-pdf/requirements.txt b/gemini/sample-apps/genwealth/function-scripts/process-pdf/requirements.txt index f5aede86e9..b1d18c2d01 100644 --- a/gemini/sample-apps/genwealth/function-scripts/process-pdf/requirements.txt +++ b/gemini/sample-apps/genwealth/function-scripts/process-pdf/requirements.txt @@ -1,6 +1,6 @@ functions-framework==3.* google-cloud-pubsub==2.20.2 -aiohttp==3.9.3 +aiohttp==3.9.4 aiomysql==0.2.0 aiosignal==1.3.1 annotated-types==0.6.0 From c1de2214b4f42f7d4df6cd4decc05a301656e8a4 Mon Sep 17 00:00:00 2001 From: Ivan Nardini <88703814+inardini@users.noreply.github.com> Date: Thu, 18 Apr 2024 19:10:42 +0200 Subject: [PATCH 2/2] feat: adding an e2e notebook to show how to tuning embeddings model on Vertex AI (#551) --- .../get_started_with_embedding_tuning.ipynb | 1438 +++++++++++++++++ 1 file changed, 1438 insertions(+) create mode 100644 embeddings/get_started_with_embedding_tuning.ipynb diff --git a/embeddings/get_started_with_embedding_tuning.ipynb b/embeddings/get_started_with_embedding_tuning.ipynb new file mode 100644 index 0000000000..dca290ff4e --- /dev/null +++ b/embeddings/get_started_with_embedding_tuning.ipynb @@ -0,0 +1,1438 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ur8xi4C7S06n" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JAPoU8Sm5E6e" + }, + "source": [ + "# Get started with embeddings tuning on Vertex AI\n", + "\n", + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Google
Open in Colab Enterprise\n", + "
\n", + "
\n", + " \n", + " \"Vertex
Open in Workbench\n", + "
\n", + "
\n", + " \n", + " \"GitHub
View on GitHub\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Vzj1qV_dPeO" + }, + "source": [ + "| | |\n", + "|-|-|\n", + "|Author(s) | [Ivan Nardini](https://github.com/inardini)|" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvgnzT1CKxrO" + }, + "source": [ + "## Overview\n", + "\n", + "This notebook guides you through the process of tuning the text embedding model on Vertex AI. Tuning an embeddings model for specific domains/tasks enhances understanding and improves retrival performance.\n", + "\n", + "Learn more about [Tune text embeddings](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d975e698c9a4" + }, + "source": [ + "### Objective\n", + "\n", + "In this tutorial, you learn how to tune the text embedding model, `textembedding-gecko`.\n", + "\n", + "This tutorial uses the following Google Cloud ML services and resources:\n", + "\n", + "- Document AI\n", + "- Vertex AI\n", + "- Google Cloud Storage\n", + "\n", + "The steps include:\n", + "\n", + "- Prepare your model tuning dataset using Document AI, Gemini API, and LangChain on Vertex AI. \n", + "- Run an embedding tuning job on Vertex AI Pipelines.\n", + "- Evaluate the embedding tuned model.\n", + "- Deploy the embedding tuned model on Vertex AI Prediction.\n", + "- Retrive similar items using the tuned embedding model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "08d289fa873f" + }, + "source": [ + "### Dataset\n", + "\n", + "During the tutorial, you will create a set of synthetic query-chunk pairs using the [2023 Q3 Alphabet Earnings Release](https://www.abc.xyz/assets/95/eb/9cef90184e09bac553796896c633/2023q4-alphabet-earnings-release.pdf)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aed92deeb4a0" + }, + "source": [ + "### Costs\n", + "\n", + "This tutorial uses billable components of Google Cloud:\n", + "\n", + "* Document AI\n", + "* Vertex AI\n", + "* Cloud Storage\n", + "\n", + "Learn about [Document AI pricing](https://cloud.google.com/document-ai/pricing), [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),\n", + "and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),\n", + "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", + "to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i7EUnXsZhAGF" + }, + "source": [ + "## Installation\n", + "\n", + "Install the following packages required to execute this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2b4ef9b72d43" + }, + "outputs": [], + "source": [ + "! pip3 install --upgrade --user google-cloud-aiplatform==1.48.0 google-cloud-documentai==2.26.0 google-cloud-documentai-toolbox==0.13.3a0\n", + "! pip3 install --upgrade --user langchain-core==0.1.44 langchain-text-splitters==0.0.1 langchain-google-community==1.0.2 gcsfs==2024.3.1 etils==1.7.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "58707a750154" + }, + "source": [ + "### Colab only: Uncomment the following cell to restart the kernel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "f200f10a1da3" + }, + "outputs": [], + "source": [ + "# import IPython\n", + "\n", + "# app = IPython.Application.instance()\n", + "# app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BF1j6f9HApxa" + }, + "source": [ + "## Before you begin\n", + "\n", + "### Set up your Google Cloud project\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", + "\n", + "3. [Enable APIs](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,documentai.googleapis.com).\n", + "\n", + "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WReHDGG5g0XY" + }, + "source": [ + "#### Set your project ID\n", + "\n", + "**If you don't know your project ID**, try the following:\n", + "* Run `gcloud config list`.\n", + "* Run `gcloud projects list`.\n", + "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oM1iC_MfAts1" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", + "\n", + "# Set the project id\n", + "! gcloud config set project {PROJECT_ID}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "region" + }, + "source": [ + "#### Region\n", + "\n", + "You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ChcYWoVdhzVb" + }, + "outputs": [], + "source": [ + "REGION = \"us-central1\" # @param {type: \"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "timestamp" + }, + "source": [ + "#### Timestamp\n", + "\n", + "If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W6Le1schAziq" + }, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "\n", + "TIMESTAMP = datetime.now().strftime(\"%Y%m%d%H%M%S\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sBCra4QMA2wR" + }, + "source": [ + "### Authenticate your Google Cloud account\n", + "\n", + "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74ccc9e52986" + }, + "source": [ + "**1. Vertex AI Workbench and Colab Enterprise**\n", + "* Do nothing as you are already authenticated." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "de775a3773ba" + }, + "source": [ + "**2. Local JupyterLab instance, uncomment and run:**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "254614fa0c46" + }, + "outputs": [], + "source": [ + "# ! gcloud auth login" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f6b2ccc891ed" + }, + "source": [ + "**3. Service account or other**\n", + "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zgPO1eR3CYjk" + }, + "source": [ + "### Create a Cloud Storage bucket\n", + "\n", + "Create a storage bucket to store intermediate artifacts such as datasets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MzGDU7TWdts_" + }, + "outputs": [], + "source": [ + "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-EcIXiGsCePi" + }, + "source": [ + "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NIq7R4HZCfIc" + }, + "outputs": [], + "source": [ + "! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eckavkeph5zB" + }, + "source": [ + "### Set up tutorial folder\n", + "\n", + "Set up a folder for tutorial content including data, metadata and more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Kr90HWKmh8H0" + }, + "outputs": [], + "source": [ + "from pathlib import Path as path\n", + "\n", + "root_path = path.cwd()\n", + "tutorial_path = root_path / \"tutorial\"\n", + "data_path = tutorial_path / \"data\"\n", + "\n", + "data_path.mkdir(parents=True, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "960505627ddf" + }, + "source": [ + "### Import libraries\n", + "\n", + "Import libraries to run the tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PyQmSRbKA8r-" + }, + "outputs": [], + "source": [ + "import random\n", + "import string\n", + "import time\n", + "\n", + "import langchain_core\n", + "import numpy as np\n", + "import pandas as pd\n", + "import vertexai\n", + "import vertexai.preview.generative_models as generative_models\n", + "from etils import epath\n", + "from google.api_core.client_options import ClientOptions\n", + "from google.cloud import aiplatform, documentai\n", + "from google.protobuf.json_format import MessageToDict\n", + "from langchain_community.document_loaders.blob_loaders import Blob\n", + "from langchain_community.document_loaders.parsers import DocAIParser\n", + "from langchain_core.documents.base import Document\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "from vertexai.generative_models import GenerationConfig, GenerativeModel" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TvQ81PjSiCuZ" + }, + "source": [ + "### Set Variables\n", + "\n", + "Set variables to run the tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ajbQb0eXu3xh" + }, + "outputs": [], + "source": [ + "ID = \"\".join(random.choices(string.ascii_lowercase + string.digits, k=4))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ct35zelZiELu" + }, + "outputs": [], + "source": [ + "# Dataset\n", + "PROCESSOR_ID = f\"preprocess-docs-llm-{ID}\"\n", + "LOCATION = REGION.split(\"-\")[0]\n", + "RAW_DATA_URI = \"gs://github-repo/embeddings/get_started_with_embedding_tuning\"\n", + "PROCESSED_DATA_URI = f\"{BUCKET_URI}/data/processed\"\n", + "PREPARED_DATA_URI = f\"{BUCKET_URI}/data/prepared\"\n", + "PROCESSED_DATA_OCR_URI = f\"{BUCKET_URI}/data/processed/ocr\"\n", + "PROCESSED_DATA_TUNING_URI = f\"{BUCKET_URI}/data/processed/tuning\"\n", + "\n", + "# Tuning\n", + "PIPELINE_ROOT = f\"{BUCKET_URI}/pipelines\"\n", + "BATCH_SIZE = 32 # @param {type:\"integer\"}\n", + "TRAINING_ACCELERATOR_TYPE = \"NVIDIA_TESLA_T4\" # @param {type:\"string\"}\n", + "TRAINING_MACHINE_TYPE = \"n1-standard-16\" # @param {type:\"string\"}\n", + "\n", + "# Serving\n", + "PREDICTION_ACCELERATOR_TYPE = \"NVIDIA_TESLA_A100\" # @param {type:\"string\"}\n", + "PREDICTION_ACCELERATOR_COUNT = 1 # @param {type:\"integer\"}\n", + "PREDICTION_MACHINE_TYPE = \"a2-highgpu-1g\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F2mkgxcciGiZ" + }, + "source": [ + "### Helpers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "40tjc8jmiH4B" + }, + "outputs": [], + "source": [ + "def create_processor(project_id: str, location: str, processor_display_name: str):\n", + " \"\"\"Create a Document AI processor.\"\"\"\n", + " client_options = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\n", + " client = documentai.DocumentProcessorServiceClient(client_options=client_options)\n", + "\n", + " parent = client.common_location_path(project_id, location)\n", + "\n", + " return client.create_processor(\n", + " parent=parent,\n", + " processor=documentai.Processor(\n", + " display_name=processor_display_name, type_=\"OCR_PROCESSOR\"\n", + " ),\n", + " )\n", + "\n", + "\n", + "def generate_queries(\n", + " chuck: str,\n", + " num_questions: int = 3,\n", + ") -> langchain_core.documents.base.Document:\n", + " \"\"\"A function to generate contextual queries based on preprocessed chuck\"\"\"\n", + "\n", + " model = GenerativeModel(\"gemini-1.0-pro-001\")\n", + "\n", + " generation_config = GenerationConfig(\n", + " max_output_tokens=2048, temperature=0.9, top_p=1\n", + " )\n", + "\n", + " safety_settings = {\n", + " generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", + " generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", + " generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", + " generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", + " }\n", + "\n", + " prompt_template = \"\"\"\n", + " You are an examinator. Your task is to create one QUESTION for an exam using only.\n", + "\n", + " \n", + " {chuck}\n", + " \n", + "\n", + " QUESTION:\n", + " \"\"\"\n", + "\n", + " query = prompt_template.format(\n", + " chuck=chuck.page_content, num_questions=num_questions\n", + " )\n", + "\n", + " for idx in range(num_questions):\n", + " response = model.generate_content(\n", + " [query],\n", + " generation_config=generation_config,\n", + " safety_settings=safety_settings,\n", + " ).text\n", + "\n", + " return Document(\n", + " page_content=response, metadata={\"page\": chuck.metadata[\"page\"]}\n", + " )\n", + "\n", + "\n", + "def get_task_by_name(job: aiplatform.PipelineJob, task_name: str):\n", + " \"\"\"Get a Vertex AI Pipeline job task by its name\"\"\"\n", + " for task in job.task_details:\n", + " if task.task_name == task_name:\n", + " return task\n", + " raise ValueError(f\"Task {task_name} not found\")\n", + "\n", + "\n", + "def get_metrics(\n", + " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-evaluator\"\n", + "):\n", + " \"\"\"Get metrics for the evaluation task\"\"\"\n", + " evaluation_task = get_task_by_name(job, task_name)\n", + " metrics = MessageToDict(evaluation_task.outputs[\"metrics\"]._pb)[\"artifacts\"][0][\n", + " \"metadata\"\n", + " ]\n", + " metrics_df = pd.DataFrame([metrics])\n", + " return metrics_df\n", + "\n", + "\n", + "def get_uploaded_model(\n", + " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-model-uploader\"\n", + ") -> aiplatform.Model:\n", + " \"\"\"Get uploaded model from the pipeline job\"\"\"\n", + " evaluation_task = get_task_by_name(job, task_name)\n", + " upload_metadata = MessageToDict(evaluation_task.execution._pb)[\"metadata\"]\n", + " return aiplatform.Model(upload_metadata[\"output:model_resource_name\"])\n", + "\n", + "\n", + "def get_training_output_dir(\n", + " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-trainer\"\n", + ") -> str:\n", + " \"\"\"Get training output directory for the pipeline job\"\"\"\n", + " trainer_task = get_task_by_name(job, task_name)\n", + " output_artifacts = MessageToDict(trainer_task.outputs[\"training_output\"]._pb)[\n", + " \"artifacts\"\n", + " ][0]\n", + " return output_artifacts[\"uri\"]\n", + "\n", + "\n", + "def get_top_k_scores(\n", + " query_embedding: pd.DataFrame, corpus_embeddings: pd.DataFrame, k=10\n", + ") -> pd.DataFrame:\n", + " \"\"\"Get top k similar scores for each query\"\"\"\n", + " similarity = corpus_embeddings.dot(query_embedding.T)\n", + " topk_index = pd.DataFrame({c: v.nlargest(n=k).index for c, v in similarity.items()})\n", + " return topk_index\n", + "\n", + "\n", + "def get_top_k_documents(\n", + " query_text: list[str],\n", + " corpus_text: pd.DataFrame,\n", + " corpus_embeddings: pd.DataFrame,\n", + " task_type: str = \"RETRIEVAL_DOCUMENT\",\n", + " title: str = \"\",\n", + " k: int = 10,\n", + ") -> pd.DataFrame:\n", + " \"\"\"Get top k similar documents for each query\"\"\"\n", + " instances = []\n", + " for text in query_text:\n", + " instances.append(\n", + " {\n", + " \"content\": text,\n", + " \"task_type\": task_type,\n", + " \"title\": title,\n", + " }\n", + " )\n", + "\n", + " response = endpoint.predict(instances=instances)\n", + " query_embedding = np.asarray(response.predictions)\n", + " topk = get_top_k_scores(query_embedding, corpus_embeddings, k)\n", + " return pd.DataFrame.from_dict(\n", + " {\n", + " query_text[c]: corpus_text.loc[v.values].values.ravel()\n", + " for c, v in topk.items()\n", + " },\n", + " orient=\"columns\",\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "init_aip:mbsdk,all" + }, + "source": [ + "### Initialize Vertex AI SDK for Python\n", + "\n", + "Initialize the Vertex AI SDK for Python for your project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u4p_LXiGhzVc" + }, + "outputs": [], + "source": [ + "vertexai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FpDyETD1iKi-" + }, + "source": [ + "## Tuning text embeddings\n", + "\n", + "To tune the model, you should start by preparing your model tuning dataset and then upload it to a Cloud Storage bucket. Text embedding models support supervised tuning, which uses labeled examples to demonstrate the desired output from the model during inference.\n", + "\n", + "Next, you create a model tuning job and deploy the tuned model to a Vertex AI endpoint.\n", + "\n", + "Finally, you retrive similar items using the tuned embedding model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WJx_WMx2wioP" + }, + "source": [ + "### Prepare your model tuning dataset using Document AI, Gemini API, and LangChain on Vertex AI\n", + "\n", + "The tuning dataset consists of the following files:\n", + "\n", + "- `corpus` file is a JSONL file where each line has the fields `_id`, `title` (optional), and `text` of each relevant chuck.\n", + "\n", + "- `query` file is a JSONL file where each line has the fields `_id`, and `text` of each relevant query.\n", + "\n", + "- `labels` files are TSV files (train, test and val) with three columns: `query-id`,`corpus-id`, and `score`. `query-id` represents the query id in the query file, `corpus-id` represents the corpus id in the corpus file, and `score` indicates relevance with higher scores meaning greater relevance. A default score of 1 is used if none is specified. The `train` file is required while `test` and `val` are optional.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0eDQ8BSFiOH9" + }, + "source": [ + "#### Create a Document AI preprocessor\n", + "\n", + "Create the OCR processor to identify and extract text in PDF document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZIJXJt6ciNdj" + }, + "outputs": [], + "source": [ + "processor = create_processor(PROJECT_ID, LOCATION, PROCESSOR_ID)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P8podNK-rxm6" + }, + "source": [ + "#### Parse the document using DocAI Parser in LangChain" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jCUWtEq3oLqs" + }, + "source": [ + "Initiate a LangChain parser." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "D_PU-I_-teWQ" + }, + "outputs": [], + "source": [ + "blob = Blob(\n", + " path=f\"{RAW_DATA_URI}/goog-10-k-2023.pdf\",\n", + ")\n", + "\n", + "parser = DocAIParser(\n", + " processor_name=processor.name,\n", + " location=LOCATION,\n", + " gcs_output_path=PROCESSED_DATA_OCR_URI,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hz254MGnoPgR" + }, + "source": [ + "Run a Google Document AI PDF Batch Processing job.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sjm3LSh7oKJk" + }, + "outputs": [], + "source": [ + "operations = parser.docai_parse([blob])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5_DpZodtvATw" + }, + "outputs": [], + "source": [ + "while True:\n", + " if parser.is_running(operations):\n", + " print(\"Waiting for DocAI to finish...\")\n", + " time.sleep(10)\n", + " else:\n", + " print(\"DocAI successfully processed!\")\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6MjwFF6loZfF" + }, + "source": [ + "Get the resulting LangChain Documents containing the extracted text and metadata." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oBh2AKdIvX8g" + }, + "outputs": [], + "source": [ + "results = parser.get_results(operations)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZGfde504wIcF" + }, + "outputs": [], + "source": [ + "docs = list(parser.parse_from_results(results))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rFsBLHpwWRCy" + }, + "outputs": [], + "source": [ + "docs[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MXGdZYB_x9R6" + }, + "source": [ + "#### Create document chunks using `RecursiveCharacterTextSplitter`\n", + "\n", + "You can create chucks using `RecursiveCharacterTextSplitter` in LangChain. The splitter divides text into smaller chunks of a chosen size based on a set of specified characters." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y_JIYFd7qePe" + }, + "source": [ + "Initiate the splitter." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TxBSDPGPyA4l" + }, + "outputs": [], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=2500,\n", + " chunk_overlap=250,\n", + " length_function=len,\n", + " is_separator_regex=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Le_WIyGCqg7O" + }, + "source": [ + "Create Text chunks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "M8BxdLjUycm4" + }, + "outputs": [], + "source": [ + "document_content = [doc.page_content for doc in docs]\n", + "document_metadata = [{\"page\": idx} for idx, doc in enumerate(docs, 1)]\n", + "chunks = text_splitter.create_documents(document_content, metadatas=document_metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3wm91wQnB-Xd" + }, + "source": [ + "#### Create queries\n", + "\n", + "You can utilize Gemini on Vertex AI to produce hypothetical questions that are relevant to a given piece of context (chunk). \n", + "This approach enables the generation of synthetic positive pairs of (query, relevant documents) in a scalable manner.\n", + "\n", + "Running the query generation would require **some minutes** depending on the number of chunks you have. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LsnSHzZBCCij" + }, + "outputs": [], + "source": [ + "generated_queries = [generate_queries(chuck=chuck, num_questions=3) for chuck in chunks]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x2XYblPb8Uvy" + }, + "source": [ + "#### Create the tuning training and test dataset files." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b78Kh1K0vP3s" + }, + "source": [ + "Create the `corpus` file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Zbl6sB3-8YdP" + }, + "outputs": [], + "source": [ + "corpus_df = pd.DataFrame(\n", + " {\n", + " \"_id\": [\"text_\" + str(idx) for idx in range(len(generated_queries))],\n", + " \"text\": [chuck.page_content for chuck in chunks],\n", + " \"doc_id\": [chuck.metadata[\"page\"] for chuck in chunks],\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dDCRCw-buq8U" + }, + "outputs": [], + "source": [ + "corpus_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_tu0Lx7FvVJe" + }, + "source": [ + "Create the `query` file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Fu1fFhdkrCoq" + }, + "outputs": [], + "source": [ + "query_df = pd.DataFrame(\n", + " {\n", + " \"_id\": [\"query_\" + str(idx) for idx in range(len(generated_queries))],\n", + " \"text\": [query.page_content for query in generated_queries],\n", + " \"doc_id\": [query.metadata[\"page\"] for query in generated_queries],\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Uo4yaw4Xu8wJ" + }, + "outputs": [], + "source": [ + "query_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Iv9dcRqFvYN-" + }, + "source": [ + "Create the `score` file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NMDpCp06wAcX" + }, + "outputs": [], + "source": [ + "score_df = corpus_df.merge(query_df, on=\"doc_id\")\n", + "score_df = score_df.rename(columns={\"_id_x\": \"corpus-id\", \"_id_y\": \"query-id\"})\n", + "score_df = score_df.drop(columns=[\"doc_id\", \"text_x\", \"text_y\"])\n", + "score_df[\"score\"] = 1\n", + "train_df = score_df.sample(frac=0.8)\n", + "test_df = score_df.drop(train_df.index)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Tc-JAxKwxoar" + }, + "outputs": [], + "source": [ + "train_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YGgY3L-CyhAQ" + }, + "source": [ + "#### Save the tuning dataset\n", + "\n", + "Upload the model tuning datasets to a Cloud Storage bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HHiC-Qg8yorH" + }, + "outputs": [], + "source": [ + "corpus_df.to_json(\n", + " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/corpus.jsonl\",\n", + " orient=\"records\",\n", + " lines=True,\n", + ")\n", + "query_df.to_json(\n", + " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/query.jsonl\", orient=\"records\", lines=True\n", + ")\n", + "train_df.to_csv(\n", + " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/train.tsv\",\n", + " sep=\"\\t\",\n", + " header=True,\n", + " index=False,\n", + ")\n", + "test_df.to_csv(\n", + " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/test.tsv\",\n", + " sep=\"\\t\",\n", + " header=True,\n", + " index=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6p9NR9I31fo6" + }, + "source": [ + "### Run an embedding tuning job on Vertex AI Pipelines\n", + "\n", + "Next, set the tuning pipeline parameters including the Cloud Storage bucket paths with train and test datasets, the training batch size and the number of steps to perform model tuning. \n", + "\n", + "For more information about pipeline parameters, [check](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings#create-embedding-tuning-job) the official tuning documentation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YePRoZg31iSJ" + }, + "outputs": [], + "source": [ + "ITERATIONS = len(train_df) // BATCH_SIZE\n", + "\n", + "params = {\n", + " \"batch_size\": BATCH_SIZE,\n", + " \"iterations\": ITERATIONS,\n", + " \"accelerator_type\": TRAINING_ACCELERATOR_TYPE,\n", + " \"machine_type\": TRAINING_MACHINE_TYPE,\n", + " \"base_model_version_id\": \"textembedding-gecko@003\",\n", + " \"queries_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/query.jsonl\",\n", + " \"corpus_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/corpus.jsonl\",\n", + " \"train_label_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/train.tsv\",\n", + " \"test_label_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/test.tsv\",\n", + " \"project\": PROJECT_ID,\n", + " \"location\": REGION,\n", + "}\n", + "\n", + "template_uri = \"https://us-kfp.pkg.dev/ml-pipeline/llm-text-embedding/tune-text-embedding-model/v1.1.1\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0rN_XWFjxWZn" + }, + "source": [ + "Run the model tuning pipeline job." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "m7JEoMT-1mAC" + }, + "outputs": [], + "source": [ + "job = aiplatform.PipelineJob(\n", + " display_name=\"tune-text-embedding\",\n", + " parameter_values=params,\n", + " template_path=template_uri,\n", + " pipeline_root=PIPELINE_ROOT,\n", + " project=PROJECT_ID,\n", + " location=REGION,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ExF5xlj0uBjK" + }, + "outputs": [], + "source": [ + "job.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UJCufZmS7SVM" + }, + "source": [ + "### Evaluate the tuned model\n", + "\n", + "Evaluate the tuned embedding model. The Vertex AI Pipeline automatically produces NDCG (Normalized Discounted Cumulative Gain) for both training and test datasets. NDCG measures ranking effectiveness taking position of relevant items in the ranked list.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eldc535Y7xmD" + }, + "outputs": [], + "source": [ + "metric_df = get_metrics(job)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W-s_AEuoaKSd" + }, + "outputs": [], + "source": [ + "metric_df.to_dict()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qMBUmPFgBRWi" + }, + "outputs": [], + "source": [ + "metric_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZP6nCnJ2_6wU" + }, + "source": [ + "### Deploy the embedding tuned model on Vertex AI Prediction\n", + "\n", + "To deploy the embedding tuned model, you need to create an Vertex AI Endpoint.\n", + "\n", + "Then you deploy the tuned embeddings model to the endpoint." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d5LtEGEbAHPd" + }, + "source": [ + "#### Create the endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Q2KRaRzHAF8f" + }, + "outputs": [], + "source": [ + "endpoint = aiplatform.Endpoint.create(\n", + " display_name=\"tuned_custom_embedding_endpoint\",\n", + " description=\"Endpoint for tuned model embeddings.\",\n", + " project=PROJECT_ID,\n", + " location=REGION,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xGLjbFY-AYi3" + }, + "source": [ + "#### Deploy the tuned model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mdfedMEj1ZNy" + }, + "source": [ + "Get the tuned model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ndE5WGVkBQA2" + }, + "outputs": [], + "source": [ + "model = get_uploaded_model(job)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CqkAAzvV1ewQ" + }, + "source": [ + "Deploy the tuned model to the endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nOZyyPALDcSP" + }, + "outputs": [], + "source": [ + "endpoint.deploy(\n", + " model,\n", + " accelerator_type=PREDICTION_ACCELERATOR_TYPE,\n", + " accelerator_count=PREDICTION_ACCELERATOR_COUNT,\n", + " machine_type=PREDICTION_MACHINE_TYPE,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8VVac8ah2DzJ" + }, + "source": [ + "### Retrieve similar items using the tuned embedding model\n", + "\n", + "To retrieve similar items using the tuned embedding model, you need both the corpus text and the generated embeddings. Given a query, you will calculate the associated embeddings with the tuned model and you will apply a similarity function to find the most relevant document with respect the query. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L2ohcM4a6mOx" + }, + "source": [ + "Read the corpus text and the generated embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "C5O2WP_i1z3I" + }, + "outputs": [], + "source": [ + "training_output_dir = get_training_output_dir(job)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TJcTLpV263tK" + }, + "outputs": [], + "source": [ + "corpus_text = pd.read_json(\n", + " epath.Path(training_output_dir) / \"corpus_text.jsonl\", lines=True\n", + ")\n", + "corpus_text.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3g-8ECt9VU_P" + }, + "outputs": [], + "source": [ + "corpus_embeddings = get_df_from_jsonl(\n", + " epath.Path(training_output_dir) / \"corpus_custom.jsonl\"\n", + ")\n", + "\n", + "corpus_embeddings.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Dwbak5t670Y" + }, + "source": [ + "Find the most relevant documents for each query." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IV9d2afP11em" + }, + "outputs": [], + "source": [ + "queries = [\n", + " \"\"\"What about the revenues?\"\"\",\n", + " \"\"\"Who is Alphabet?\"\"\",\n", + " \"\"\"What about the costs?\"\"\",\n", + "]\n", + "output = get_top_k_documents(queries, corpus_text, corpus_embeddings, k=10)\n", + "\n", + "with pd.option_context(\"display.max_colwidth\", 200):\n", + " display(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TpV-iwP9qw9c" + }, + "source": [ + "## Cleaning up\n", + "\n", + "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", + "\n", + "Otherwise, you can delete the individual resources you created in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sx_vKniMq9ZX" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "delete_endpoint = False\n", + "delete_model = False\n", + "delete_job = False\n", + "delete_bucket = False\n", + "\n", + "# Delete endpoint resource\n", + "if delete_endpoint or os.getenv(\"IS_TESTING\"):\n", + " endpoint.delete()\n", + "\n", + "# Delete model resource\n", + "if delete_model or os.getenv(\"IS_TESTING\"):\n", + " model.delete()\n", + "\n", + "# Delete pipeline job\n", + "if delete_job or os.getenv(\"IS_TESTING\"):\n", + " job.delete()\n", + "\n", + "# Delete Cloud Storage objects that were created\n", + "if delete_bucket or os.getenv(\"IS_TESTING\"):\n", + " ! gsutil -m rm -r $BUCKET_URI" + ] + } + ], + "metadata": { + "colab": { + "name": "get_started_with_embedding_tuning.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}