diff --git a/embeddings/get_started_with_embedding_tuning.ipynb b/embeddings/get_started_with_embedding_tuning.ipynb index bd1474596b..5857226220 100644 --- a/embeddings/get_started_with_embedding_tuning.ipynb +++ b/embeddings/get_started_with_embedding_tuning.ipynb @@ -1,1438 +1,1457 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ur8xi4C7S06n" - }, - "outputs": [], - "source": [ - "# Copyright 2024 Google LLC\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JAPoU8Sm5E6e" - }, - "source": [ - "# Get started with embeddings tuning on Vertex AI\n", - "\n", - "\n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \"Google
Open in Colab Enterprise\n", - "
\n", - "
\n", - " \n", - " \"Vertex
Open in Workbench\n", - "
\n", - "
\n", - " \n", - " \"GitHub
View on GitHub\n", - "
\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3Vzj1qV_dPeO" - }, - "source": [ - "| | |\n", - "|-|-|\n", - "|Author(s) | [Ivan Nardini](https://github.com/inardini)|" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tvgnzT1CKxrO" - }, - "source": [ - "## Overview\n", - "\n", - "This notebook guides you through the process of tuning the text embedding model on Vertex AI. Tuning an embeddings model for specific domains/tasks enhances understanding and improves retrival performance.\n", - "\n", - "Learn more about [Tune text embeddings](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d975e698c9a4" - }, - "source": [ - "### Objective\n", - "\n", - "In this tutorial, you learn how to tune the text embedding model, `textembedding-gecko`.\n", - "\n", - "This tutorial uses the following Google Cloud ML services and resources:\n", - "\n", - "- Document AI\n", - "- Vertex AI\n", - "- Google Cloud Storage\n", - "\n", - "The steps include:\n", - "\n", - "- Prepare your model tuning dataset using Document AI, Gemini API, and LangChain on Vertex AI. \n", - "- Run an embedding tuning job on Vertex AI Pipelines.\n", - "- Evaluate the embedding tuned model.\n", - "- Deploy the embedding tuned model on Vertex AI Prediction.\n", - "- Retrive similar items using the tuned embedding model." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "08d289fa873f" - }, - "source": [ - "### Dataset\n", - "\n", - "During the tutorial, you will create a set of synthetic query-chunk pairs using the [2023 Q3 Alphabet Earnings Release](https://www.abc.xyz/assets/95/eb/9cef90184e09bac553796896c633/2023q4-alphabet-earnings-release.pdf)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aed92deeb4a0" - }, - "source": [ - "### Costs\n", - "\n", - "This tutorial uses billable components of Google Cloud:\n", - "\n", - "* Document AI\n", - "* Vertex AI\n", - "* Cloud Storage\n", - "\n", - "Learn about [Document AI pricing](https://cloud.google.com/document-ai/pricing), [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),\n", - "and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),\n", - "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", - "to generate a cost estimate based on your projected usage." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "i7EUnXsZhAGF" - }, - "source": [ - "## Installation\n", - "\n", - "Install the following packages required to execute this notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2b4ef9b72d43" - }, - "outputs": [], - "source": [ - "! pip3 install --upgrade --user google-cloud-aiplatform==1.48.0 google-cloud-documentai==2.26.0 google-cloud-documentai-toolbox==0.13.3a0\n", - "! pip3 install --upgrade --user langchain==0.1.16 langchain-core==0.1.44 langchain-text-splitters==0.0.1 langchain-google-community==1.0.2 gcsfs==2024.3.1 etils==1.7.0" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "58707a750154" - }, - "source": [ - "### Colab only: Uncomment the following cell to restart the kernel." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "f200f10a1da3" - }, - "outputs": [], - "source": [ - "# import IPython\n", - "\n", - "# app = IPython.Application.instance()\n", - "# app.kernel.do_shutdown(True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BF1j6f9HApxa" - }, - "source": [ - "## Before you begin\n", - "\n", - "### Set up your Google Cloud project\n", - "\n", - "**The following steps are required, regardless of your notebook environment.**\n", - "\n", - "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", - "\n", - "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", - "\n", - "3. [Enable APIs](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,documentai.googleapis.com).\n", - "\n", - "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WReHDGG5g0XY" - }, - "source": [ - "#### Set your project ID\n", - "\n", - "**If you don't know your project ID**, try the following:\n", - "* Run `gcloud config list`.\n", - "* Run `gcloud projects list`.\n", - "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "oM1iC_MfAts1" - }, - "outputs": [], - "source": [ - "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", - "\n", - "# Set the project id\n", - "! gcloud config set project {PROJECT_ID}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "region" - }, - "source": [ - "#### Region\n", - "\n", - "You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ChcYWoVdhzVb" - }, - "outputs": [], - "source": [ - "REGION = \"us-central1\" # @param {type: \"string\"}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "timestamp" - }, - "source": [ - "#### Timestamp\n", - "\n", - "If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "W6Le1schAziq" - }, - "outputs": [], - "source": [ - "from datetime import datetime\n", - "\n", - "TIMESTAMP = datetime.now().strftime(\"%Y%m%d%H%M%S\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sBCra4QMA2wR" - }, - "source": [ - "### Authenticate your Google Cloud account\n", - "\n", - "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "74ccc9e52986" - }, - "source": [ - "**1. Vertex AI Workbench and Colab Enterprise**\n", - "* Do nothing as you are already authenticated." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "de775a3773ba" - }, - "source": [ - "**2. Local JupyterLab instance, uncomment and run:**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "254614fa0c46" - }, - "outputs": [], - "source": [ - "# ! gcloud auth login" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "f6b2ccc891ed" - }, - "source": [ - "**3. Service account or other**\n", - "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zgPO1eR3CYjk" - }, - "source": [ - "### Create a Cloud Storage bucket\n", - "\n", - "Create a storage bucket to store intermediate artifacts such as datasets." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "MzGDU7TWdts_" - }, - "outputs": [], - "source": [ - "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-EcIXiGsCePi" - }, - "source": [ - "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "NIq7R4HZCfIc" - }, - "outputs": [], - "source": [ - "! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eckavkeph5zB" - }, - "source": [ - "### Set up tutorial folder\n", - "\n", - "Set up a folder for tutorial content including data, metadata and more." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Kr90HWKmh8H0" - }, - "outputs": [], - "source": [ - "from pathlib import Path as path\n", - "\n", - "root_path = path.cwd()\n", - "tutorial_path = root_path / \"tutorial\"\n", - "data_path = tutorial_path / \"data\"\n", - "\n", - "data_path.mkdir(parents=True, exist_ok=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "960505627ddf" - }, - "source": [ - "### Import libraries\n", - "\n", - "Import libraries to run the tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "PyQmSRbKA8r-" - }, - "outputs": [], - "source": [ - "import random\n", - "import string\n", - "import time\n", - "\n", - "import langchain_core\n", - "import numpy as np\n", - "import pandas as pd\n", - "import vertexai\n", - "import vertexai.preview.generative_models as generative_models\n", - "from etils import epath\n", - "from google.api_core.client_options import ClientOptions\n", - "from google.cloud import aiplatform, documentai\n", - "from google.protobuf.json_format import MessageToDict\n", - "from langchain_community.document_loaders.blob_loaders import Blob\n", - "from langchain_community.document_loaders.parsers import DocAIParser\n", - "from langchain_core.documents.base import Document\n", - "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", - "from vertexai.generative_models import GenerationConfig, GenerativeModel" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TvQ81PjSiCuZ" - }, - "source": [ - "### Set Variables\n", - "\n", - "Set variables to run the tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ajbQb0eXu3xh" - }, - "outputs": [], - "source": [ - "ID = \"\".join(random.choices(string.ascii_lowercase + string.digits, k=4))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ct35zelZiELu" - }, - "outputs": [], - "source": [ - "# Dataset\n", - "PROCESSOR_ID = f\"preprocess-docs-llm-{ID}\"\n", - "LOCATION = REGION.split(\"-\")[0]\n", - "RAW_DATA_URI = \"gs://github-repo/embeddings/get_started_with_embedding_tuning\"\n", - "PROCESSED_DATA_URI = f\"{BUCKET_URI}/data/processed\"\n", - "PREPARED_DATA_URI = f\"{BUCKET_URI}/data/prepared\"\n", - "PROCESSED_DATA_OCR_URI = f\"{BUCKET_URI}/data/processed/ocr\"\n", - "PROCESSED_DATA_TUNING_URI = f\"{BUCKET_URI}/data/processed/tuning\"\n", - "\n", - "# Tuning\n", - "PIPELINE_ROOT = f\"{BUCKET_URI}/pipelines\"\n", - "BATCH_SIZE = 32 # @param {type:\"integer\"}\n", - "TRAINING_ACCELERATOR_TYPE = \"NVIDIA_TESLA_T4\" # @param {type:\"string\"}\n", - "TRAINING_MACHINE_TYPE = \"n1-standard-16\" # @param {type:\"string\"}\n", - "\n", - "# Serving\n", - "PREDICTION_ACCELERATOR_TYPE = \"NVIDIA_TESLA_A100\" # @param {type:\"string\"}\n", - "PREDICTION_ACCELERATOR_COUNT = 1 # @param {type:\"integer\"}\n", - "PREDICTION_MACHINE_TYPE = \"a2-highgpu-1g\" # @param {type:\"string\"}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F2mkgxcciGiZ" - }, - "source": [ - "### Helpers" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "40tjc8jmiH4B" - }, - "outputs": [], - "source": [ - "def create_processor(project_id: str, location: str, processor_display_name: str):\n", - " \"\"\"Create a Document AI processor.\"\"\"\n", - " client_options = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\n", - " client = documentai.DocumentProcessorServiceClient(client_options=client_options)\n", - "\n", - " parent = client.common_location_path(project_id, location)\n", - "\n", - " return client.create_processor(\n", - " parent=parent,\n", - " processor=documentai.Processor(\n", - " display_name=processor_display_name, type_=\"OCR_PROCESSOR\"\n", - " ),\n", - " )\n", - "\n", - "\n", - "def generate_queries(\n", - " chuck: str,\n", - " num_questions: int = 3,\n", - ") -> langchain_core.documents.base.Document:\n", - " \"\"\"A function to generate contextual queries based on preprocessed chuck\"\"\"\n", - "\n", - " model = GenerativeModel(\"gemini-1.0-pro-001\")\n", - "\n", - " generation_config = GenerationConfig(\n", - " max_output_tokens=2048, temperature=0.9, top_p=1\n", - " )\n", - "\n", - " safety_settings = {\n", - " generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", - " generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", - " generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", - " generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", - " }\n", - "\n", - " prompt_template = \"\"\"\n", - " You are an examinator. Your task is to create one QUESTION for an exam using only.\n", - "\n", - " \n", - " {chuck}\n", - " \n", - "\n", - " QUESTION:\n", - " \"\"\"\n", - "\n", - " query = prompt_template.format(\n", - " chuck=chuck.page_content, num_questions=num_questions\n", - " )\n", - "\n", - " for idx in range(num_questions):\n", - " response = model.generate_content(\n", - " [query],\n", - " generation_config=generation_config,\n", - " safety_settings=safety_settings,\n", - " ).text\n", - "\n", - " return Document(\n", - " page_content=response, metadata={\"page\": chuck.metadata[\"page\"]}\n", - " )\n", - "\n", - "\n", - "def get_task_by_name(job: aiplatform.PipelineJob, task_name: str):\n", - " \"\"\"Get a Vertex AI Pipeline job task by its name\"\"\"\n", - " for task in job.task_details:\n", - " if task.task_name == task_name:\n", - " return task\n", - " raise ValueError(f\"Task {task_name} not found\")\n", - "\n", - "\n", - "def get_metrics(\n", - " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-evaluator\"\n", - "):\n", - " \"\"\"Get metrics for the evaluation task\"\"\"\n", - " evaluation_task = get_task_by_name(job, task_name)\n", - " metrics = MessageToDict(evaluation_task.outputs[\"metrics\"]._pb)[\"artifacts\"][0][\n", - " \"metadata\"\n", - " ]\n", - " metrics_df = pd.DataFrame([metrics])\n", - " return metrics_df\n", - "\n", - "\n", - "def get_uploaded_model(\n", - " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-model-uploader\"\n", - ") -> aiplatform.Model:\n", - " \"\"\"Get uploaded model from the pipeline job\"\"\"\n", - " evaluation_task = get_task_by_name(job, task_name)\n", - " upload_metadata = MessageToDict(evaluation_task.execution._pb)[\"metadata\"]\n", - " return aiplatform.Model(upload_metadata[\"output:model_resource_name\"])\n", - "\n", - "\n", - "def get_training_output_dir(\n", - " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-trainer\"\n", - ") -> str:\n", - " \"\"\"Get training output directory for the pipeline job\"\"\"\n", - " trainer_task = get_task_by_name(job, task_name)\n", - " output_artifacts = MessageToDict(trainer_task.outputs[\"training_output\"]._pb)[\n", - " \"artifacts\"\n", - " ][0]\n", - " return output_artifacts[\"uri\"]\n", - "\n", - "\n", - "def get_top_k_scores(\n", - " query_embedding: pd.DataFrame, corpus_embeddings: pd.DataFrame, k=10\n", - ") -> pd.DataFrame:\n", - " \"\"\"Get top k similar scores for each query\"\"\"\n", - " similarity = corpus_embeddings.dot(query_embedding.T)\n", - " topk_index = pd.DataFrame({c: v.nlargest(n=k).index for c, v in similarity.items()})\n", - " return topk_index\n", - "\n", - "\n", - "def get_top_k_documents(\n", - " query_text: list[str],\n", - " corpus_text: pd.DataFrame,\n", - " corpus_embeddings: pd.DataFrame,\n", - " task_type: str = \"RETRIEVAL_DOCUMENT\",\n", - " title: str = \"\",\n", - " k: int = 10,\n", - ") -> pd.DataFrame:\n", - " \"\"\"Get top k similar documents for each query\"\"\"\n", - " instances = []\n", - " for text in query_text:\n", - " instances.append(\n", - " {\n", - " \"content\": text,\n", - " \"task_type\": task_type,\n", - " \"title\": title,\n", - " }\n", - " )\n", - "\n", - " response = endpoint.predict(instances=instances)\n", - " query_embedding = np.asarray(response.predictions)\n", - " topk = get_top_k_scores(query_embedding, corpus_embeddings, k)\n", - " return pd.DataFrame.from_dict(\n", - " {\n", - " query_text[c]: corpus_text.loc[v.values].values.ravel()\n", - " for c, v in topk.items()\n", - " },\n", - " orient=\"columns\",\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "init_aip:mbsdk,all" - }, - "source": [ - "### Initialize Vertex AI SDK for Python\n", - "\n", - "Initialize the Vertex AI SDK for Python for your project." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "u4p_LXiGhzVc" - }, - "outputs": [], - "source": [ - "vertexai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FpDyETD1iKi-" - }, - "source": [ - "## Tuning text embeddings\n", - "\n", - "To tune the model, you should start by preparing your model tuning dataset and then upload it to a Cloud Storage bucket. Text embedding models support supervised tuning, which uses labeled examples to demonstrate the desired output from the model during inference.\n", - "\n", - "Next, you create a model tuning job and deploy the tuned model to a Vertex AI endpoint.\n", - "\n", - "Finally, you retrive similar items using the tuned embedding model." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WJx_WMx2wioP" - }, - "source": [ - "### Prepare your model tuning dataset using Document AI, Gemini API, and LangChain on Vertex AI\n", - "\n", - "The tuning dataset consists of the following files:\n", - "\n", - "- `corpus` file is a JSONL file where each line has the fields `_id`, `title` (optional), and `text` of each relevant chuck.\n", - "\n", - "- `query` file is a JSONL file where each line has the fields `_id`, and `text` of each relevant query.\n", - "\n", - "- `labels` files are TSV files (train, test and val) with three columns: `query-id`,`corpus-id`, and `score`. `query-id` represents the query id in the query file, `corpus-id` represents the corpus id in the corpus file, and `score` indicates relevance with higher scores meaning greater relevance. A default score of 1 is used if none is specified. The `train` file is required while `test` and `val` are optional.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0eDQ8BSFiOH9" - }, - "source": [ - "#### Create a Document AI preprocessor\n", - "\n", - "Create the OCR processor to identify and extract text in PDF document." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZIJXJt6ciNdj" - }, - "outputs": [], - "source": [ - "processor = create_processor(PROJECT_ID, LOCATION, PROCESSOR_ID)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "P8podNK-rxm6" - }, - "source": [ - "#### Parse the document using DocAI Parser in LangChain" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jCUWtEq3oLqs" - }, - "source": [ - "Initiate a LangChain parser." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "D_PU-I_-teWQ" - }, - "outputs": [], - "source": [ - "blob = Blob(\n", - " path=f\"{RAW_DATA_URI}/goog-10-k-2023.pdf\",\n", - ")\n", - "\n", - "parser = DocAIParser(\n", - " processor_name=processor.name,\n", - " location=LOCATION,\n", - " gcs_output_path=PROCESSED_DATA_OCR_URI,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hz254MGnoPgR" - }, - "source": [ - "Run a Google Document AI PDF Batch Processing job.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "sjm3LSh7oKJk" - }, - "outputs": [], - "source": [ - "operations = parser.docai_parse([blob])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5_DpZodtvATw" - }, - "outputs": [], - "source": [ - "while True:\n", - " if parser.is_running(operations):\n", - " print(\"Waiting for DocAI to finish...\")\n", - " time.sleep(10)\n", - " else:\n", - " print(\"DocAI successfully processed!\")\n", - " break" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6MjwFF6loZfF" - }, - "source": [ - "Get the resulting LangChain Documents containing the extracted text and metadata." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "oBh2AKdIvX8g" - }, - "outputs": [], - "source": [ - "results = parser.get_results(operations)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZGfde504wIcF" - }, - "outputs": [], - "source": [ - "docs = list(parser.parse_from_results(results))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "rFsBLHpwWRCy" - }, - "outputs": [], - "source": [ - "docs[0]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MXGdZYB_x9R6" - }, - "source": [ - "#### Create document chunks using `RecursiveCharacterTextSplitter`\n", - "\n", - "You can create chucks using `RecursiveCharacterTextSplitter` in LangChain. The splitter divides text into smaller chunks of a chosen size based on a set of specified characters." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "y_JIYFd7qePe" - }, - "source": [ - "Initiate the splitter." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "TxBSDPGPyA4l" - }, - "outputs": [], - "source": [ - "text_splitter = RecursiveCharacterTextSplitter(\n", - " chunk_size=2500,\n", - " chunk_overlap=250,\n", - " length_function=len,\n", - " is_separator_regex=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Le_WIyGCqg7O" - }, - "source": [ - "Create Text chunks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "M8BxdLjUycm4" - }, - "outputs": [], - "source": [ - "document_content = [doc.page_content for doc in docs]\n", - "document_metadata = [{\"page\": idx} for idx, doc in enumerate(docs, 1)]\n", - "chunks = text_splitter.create_documents(document_content, metadatas=document_metadata)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3wm91wQnB-Xd" - }, - "source": [ - "#### Create queries\n", - "\n", - "You can utilize Gemini on Vertex AI to produce hypothetical questions that are relevant to a given piece of context (chunk). \n", - "This approach enables the generation of synthetic positive pairs of (query, relevant documents) in a scalable manner.\n", - "\n", - "Running the query generation would require **some minutes** depending on the number of chunks you have. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "LsnSHzZBCCij" - }, - "outputs": [], - "source": [ - "generated_queries = [generate_queries(chuck=chuck, num_questions=3) for chuck in chunks]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "x2XYblPb8Uvy" - }, - "source": [ - "#### Create the tuning training and test dataset files." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "b78Kh1K0vP3s" - }, - "source": [ - "Create the `corpus` file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Zbl6sB3-8YdP" - }, - "outputs": [], - "source": [ - "corpus_df = pd.DataFrame(\n", - " {\n", - " \"_id\": [\"text_\" + str(idx) for idx in range(len(generated_queries))],\n", - " \"text\": [chuck.page_content for chuck in chunks],\n", - " \"doc_id\": [chuck.metadata[\"page\"] for chuck in chunks],\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "dDCRCw-buq8U" - }, - "outputs": [], - "source": [ - "corpus_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_tu0Lx7FvVJe" - }, - "source": [ - "Create the `query` file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Fu1fFhdkrCoq" - }, - "outputs": [], - "source": [ - "query_df = pd.DataFrame(\n", - " {\n", - " \"_id\": [\"query_\" + str(idx) for idx in range(len(generated_queries))],\n", - " \"text\": [query.page_content for query in generated_queries],\n", - " \"doc_id\": [query.metadata[\"page\"] for query in generated_queries],\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Uo4yaw4Xu8wJ" - }, - "outputs": [], - "source": [ - "query_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Iv9dcRqFvYN-" - }, - "source": [ - "Create the `score` file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "NMDpCp06wAcX" - }, - "outputs": [], - "source": [ - "score_df = corpus_df.merge(query_df, on=\"doc_id\")\n", - "score_df = score_df.rename(columns={\"_id_x\": \"corpus-id\", \"_id_y\": \"query-id\"})\n", - "score_df = score_df.drop(columns=[\"doc_id\", \"text_x\", \"text_y\"])\n", - "score_df[\"score\"] = 1\n", - "train_df = score_df.sample(frac=0.8)\n", - "test_df = score_df.drop(train_df.index)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Tc-JAxKwxoar" - }, - "outputs": [], - "source": [ - "train_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YGgY3L-CyhAQ" - }, - "source": [ - "#### Save the tuning dataset\n", - "\n", - "Upload the model tuning datasets to a Cloud Storage bucket." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "HHiC-Qg8yorH" - }, - "outputs": [], - "source": [ - "corpus_df.to_json(\n", - " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/corpus.jsonl\",\n", - " orient=\"records\",\n", - " lines=True,\n", - ")\n", - "query_df.to_json(\n", - " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/query.jsonl\", orient=\"records\", lines=True\n", - ")\n", - "train_df.to_csv(\n", - " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/train.tsv\",\n", - " sep=\"\\t\",\n", - " header=True,\n", - " index=False,\n", - ")\n", - "test_df.to_csv(\n", - " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/test.tsv\",\n", - " sep=\"\\t\",\n", - " header=True,\n", - " index=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6p9NR9I31fo6" - }, - "source": [ - "### Run an embedding tuning job on Vertex AI Pipelines\n", - "\n", - "Next, set the tuning pipeline parameters including the Cloud Storage bucket paths with train and test datasets, the training batch size and the number of steps to perform model tuning. \n", - "\n", - "For more information about pipeline parameters, [check](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings#create-embedding-tuning-job) the official tuning documentation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "YePRoZg31iSJ" - }, - "outputs": [], - "source": [ - "ITERATIONS = len(train_df) // BATCH_SIZE\n", - "\n", - "params = {\n", - " \"batch_size\": BATCH_SIZE,\n", - " \"iterations\": ITERATIONS,\n", - " \"accelerator_type\": TRAINING_ACCELERATOR_TYPE,\n", - " \"machine_type\": TRAINING_MACHINE_TYPE,\n", - " \"base_model_version_id\": \"textembedding-gecko@003\",\n", - " \"queries_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/query.jsonl\",\n", - " \"corpus_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/corpus.jsonl\",\n", - " \"train_label_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/train.tsv\",\n", - " \"test_label_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/test.tsv\",\n", - " \"project\": PROJECT_ID,\n", - " \"location\": REGION,\n", - "}\n", - "\n", - "template_uri = \"https://us-kfp.pkg.dev/ml-pipeline/llm-text-embedding/tune-text-embedding-model/v1.1.1\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0rN_XWFjxWZn" - }, - "source": [ - "Run the model tuning pipeline job." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "m7JEoMT-1mAC" - }, - "outputs": [], - "source": [ - "job = aiplatform.PipelineJob(\n", - " display_name=\"tune-text-embedding\",\n", - " parameter_values=params,\n", - " template_path=template_uri,\n", - " pipeline_root=PIPELINE_ROOT,\n", - " project=PROJECT_ID,\n", - " location=REGION,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ExF5xlj0uBjK" - }, - "outputs": [], - "source": [ - "job.run()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UJCufZmS7SVM" - }, - "source": [ - "### Evaluate the tuned model\n", - "\n", - "Evaluate the tuned embedding model. The Vertex AI Pipeline automatically produces NDCG (Normalized Discounted Cumulative Gain) for both training and test datasets. NDCG measures ranking effectiveness taking position of relevant items in the ranked list.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "eldc535Y7xmD" - }, - "outputs": [], - "source": [ - "metric_df = get_metrics(job)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "W-s_AEuoaKSd" - }, - "outputs": [], - "source": [ - "metric_df.to_dict()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qMBUmPFgBRWi" - }, - "outputs": [], - "source": [ - "metric_df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZP6nCnJ2_6wU" - }, - "source": [ - "### Deploy the embedding tuned model on Vertex AI Prediction\n", - "\n", - "To deploy the embedding tuned model, you need to create an Vertex AI Endpoint.\n", - "\n", - "Then you deploy the tuned embeddings model to the endpoint." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d5LtEGEbAHPd" - }, - "source": [ - "#### Create the endpoint" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Q2KRaRzHAF8f" - }, - "outputs": [], - "source": [ - "endpoint = aiplatform.Endpoint.create(\n", - " display_name=\"tuned_custom_embedding_endpoint\",\n", - " description=\"Endpoint for tuned model embeddings.\",\n", - " project=PROJECT_ID,\n", - " location=REGION,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xGLjbFY-AYi3" - }, - "source": [ - "#### Deploy the tuned model" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mdfedMEj1ZNy" - }, - "source": [ - "Get the tuned model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ndE5WGVkBQA2" - }, - "outputs": [], - "source": [ - "model = get_uploaded_model(job)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "CqkAAzvV1ewQ" - }, - "source": [ - "Deploy the tuned model to the endpoint." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "nOZyyPALDcSP" - }, - "outputs": [], - "source": [ - "endpoint.deploy(\n", - " model,\n", - " accelerator_type=PREDICTION_ACCELERATOR_TYPE,\n", - " accelerator_count=PREDICTION_ACCELERATOR_COUNT,\n", - " machine_type=PREDICTION_MACHINE_TYPE,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8VVac8ah2DzJ" - }, - "source": [ - "### Retrieve similar items using the tuned embedding model\n", - "\n", - "To retrieve similar items using the tuned embedding model, you need both the corpus text and the generated embeddings. Given a query, you will calculate the associated embeddings with the tuned model and you will apply a similarity function to find the most relevant document with respect the query. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "L2ohcM4a6mOx" - }, - "source": [ - "Read the corpus text and the generated embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "C5O2WP_i1z3I" - }, - "outputs": [], - "source": [ - "training_output_dir = get_training_output_dir(job)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "TJcTLpV263tK" - }, - "outputs": [], - "source": [ - "corpus_text = pd.read_json(\n", - " epath.Path(training_output_dir) / \"corpus_text.jsonl\", lines=True\n", - ")\n", - "corpus_text.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "3g-8ECt9VU_P" - }, - "outputs": [], - "source": [ - "corpus_embeddings = pd.read_json(\n", - " epath.Path(training_output_dir) / \"corpus_custom.jsonl\", lines=True\n", - ")\n", - "\n", - "corpus_embeddings.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6Dwbak5t670Y" - }, - "source": [ - "Find the most relevant documents for each query." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "IV9d2afP11em" - }, - "outputs": [], - "source": [ - "queries = [\n", - " \"\"\"What about the revenues?\"\"\",\n", - " \"\"\"Who is Alphabet?\"\"\",\n", - " \"\"\"What about the costs?\"\"\",\n", - "]\n", - "output = get_top_k_documents(queries, corpus_text, corpus_embeddings, k=10)\n", - "\n", - "with pd.option_context(\"display.max_colwidth\", 200):\n", - " display(output)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TpV-iwP9qw9c" - }, - "source": [ - "## Cleaning up\n", - "\n", - "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", - "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", - "\n", - "Otherwise, you can delete the individual resources you created in this tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "sx_vKniMq9ZX" - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "delete_endpoint = False\n", - "delete_model = False\n", - "delete_job = False\n", - "delete_bucket = False\n", - "\n", - "# Delete endpoint resource\n", - "if delete_endpoint or os.getenv(\"IS_TESTING\"):\n", - " endpoint.delete()\n", - "\n", - "# Delete model resource\n", - "if delete_model or os.getenv(\"IS_TESTING\"):\n", - " model.delete()\n", - "\n", - "# Delete pipeline job\n", - "if delete_job or os.getenv(\"IS_TESTING\"):\n", - " job.delete()\n", - "\n", - "# Delete Cloud Storage objects that were created\n", - "if delete_bucket or os.getenv(\"IS_TESTING\"):\n", - " ! gsutil -m rm -r $BUCKET_URI" - ] - } - ], - "metadata": { - "colab": { - "name": "get_started_with_embedding_tuning.ipynb", - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ur8xi4C7S06n" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JAPoU8Sm5E6e" + }, + "source": [ + "# Get started with embeddings tuning on Vertex AI\n", + "\n", + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Google
Open in Colab Enterprise\n", + "
\n", + "
\n", + " \n", + " \"Vertex
Open in Workbench\n", + "
\n", + "
\n", + " \n", + " \"GitHub
View on GitHub\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Vzj1qV_dPeO" + }, + "source": [ + "| | |\n", + "|-|-|\n", + "|Author(s) | [Ivan Nardini](https://github.com/inardini)|" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvgnzT1CKxrO" + }, + "source": [ + "## Overview\n", + "\n", + "This notebook guides you through the process of tuning the text embedding model on Vertex AI. Tuning an embeddings model for specific domains/tasks enhances understanding and improves retrival performance.\n", + "\n", + "Learn more about [Tune text embeddings](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d975e698c9a4" + }, + "source": [ + "### Objective\n", + "\n", + "In this tutorial, you learn how to tune the text embedding model, `textembedding-gecko`.\n", + "\n", + "This tutorial uses the following Google Cloud ML services and resources:\n", + "\n", + "- Document AI\n", + "- Vertex AI\n", + "- Google Cloud Storage\n", + "\n", + "The steps include:\n", + "\n", + "- Prepare your model tuning dataset using Document AI, Gemini API, and LangChain on Vertex AI. \n", + "- Run an embedding tuning job on Vertex AI Pipelines.\n", + "- Evaluate the embedding tuned model.\n", + "- Deploy the embedding tuned model on Vertex AI Prediction.\n", + "- Retrive similar items using the tuned embedding model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "08d289fa873f" + }, + "source": [ + "### Dataset\n", + "\n", + "During the tutorial, you will create a set of synthetic query-chunk pairs using the [2023 Q3 Alphabet Earnings Release](https://www.abc.xyz/assets/95/eb/9cef90184e09bac553796896c633/2023q4-alphabet-earnings-release.pdf)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aed92deeb4a0" + }, + "source": [ + "### Costs\n", + "\n", + "This tutorial uses billable components of Google Cloud:\n", + "\n", + "* Document AI\n", + "* Vertex AI\n", + "* Cloud Storage\n", + "\n", + "Learn about [Document AI pricing](https://cloud.google.com/document-ai/pricing), [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),\n", + "and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),\n", + "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", + "to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i7EUnXsZhAGF" + }, + "source": [ + "## Installation\n", + "\n", + "Install the following packages required to execute this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2b4ef9b72d43" + }, + "outputs": [], + "source": [ + "! pip3 install --upgrade --user google-cloud-aiplatform==1.48.0 google-cloud-documentai==2.26.0 google-cloud-documentai-toolbox==0.13.3a0\n", + "! pip3 install --upgrade --user langchain==0.1.16 langchain-core==0.1.44 langchain-text-splitters==0.0.1 langchain-google-community==1.0.2 gcsfs==2024.3.1 etils==1.7.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "58707a750154" + }, + "source": [ + "### Colab only: Uncomment the following cell to restart the kernel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "f200f10a1da3" + }, + "outputs": [], + "source": [ + "# import IPython\n", + "\n", + "# app = IPython.Application.instance()\n", + "# app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BF1j6f9HApxa" + }, + "source": [ + "## Before you begin\n", + "\n", + "### Set up your Google Cloud project\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", + "\n", + "3. [Enable APIs](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,documentai.googleapis.com).\n", + "\n", + "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WReHDGG5g0XY" + }, + "source": [ + "#### Set your project ID\n", + "\n", + "**If you don't know your project ID**, try the following:\n", + "* Run `gcloud config list`.\n", + "* Run `gcloud projects list`.\n", + "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oM1iC_MfAts1" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", + "\n", + "# Set the project id\n", + "! gcloud config set project {PROJECT_ID}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "region" + }, + "source": [ + "#### Region\n", + "\n", + "You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ChcYWoVdhzVb" + }, + "outputs": [], + "source": [ + "REGION = \"us-central1\" # @param {type: \"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "timestamp" + }, + "source": [ + "#### Timestamp\n", + "\n", + "If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W6Le1schAziq" + }, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "\n", + "TIMESTAMP = datetime.now().strftime(\"%Y%m%d%H%M%S\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sBCra4QMA2wR" + }, + "source": [ + "### Authenticate your Google Cloud account\n", + "\n", + "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74ccc9e52986" + }, + "source": [ + "**1. Vertex AI Workbench and Colab Enterprise**\n", + "* Do nothing as you are already authenticated." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "de775a3773ba" + }, + "source": [ + "**2. Local JupyterLab instance, uncomment and run:**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "254614fa0c46" + }, + "outputs": [], + "source": [ + "# ! gcloud auth login" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f6b2ccc891ed" + }, + "source": [ + "**3. Service account or other**\n", + "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zgPO1eR3CYjk" + }, + "source": [ + "### Create a Cloud Storage bucket\n", + "\n", + "Create a storage bucket to store intermediate artifacts such as datasets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MzGDU7TWdts_" + }, + "outputs": [], + "source": [ + "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-EcIXiGsCePi" + }, + "source": [ + "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NIq7R4HZCfIc" + }, + "outputs": [], + "source": [ + "! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eckavkeph5zB" + }, + "source": [ + "### Set up tutorial folder\n", + "\n", + "Set up a folder for tutorial content including data, metadata and more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Kr90HWKmh8H0" + }, + "outputs": [], + "source": [ + "from pathlib import Path as path\n", + "\n", + "root_path = path.cwd()\n", + "tutorial_path = root_path / \"tutorial\"\n", + "data_path = tutorial_path / \"data\"\n", + "\n", + "data_path.mkdir(parents=True, exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "960505627ddf" + }, + "source": [ + "### Import libraries\n", + "\n", + "Import libraries to run the tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PyQmSRbKA8r-" + }, + "outputs": [], + "source": [ + "import random\n", + "import string\n", + "import time\n", + "\n", + "import langchain_core\n", + "import numpy as np\n", + "import pandas as pd\n", + "import vertexai\n", + "import vertexai.preview.generative_models as generative_models\n", + "from etils import epath\n", + "from google.api_core.client_options import ClientOptions\n", + "from google.cloud import aiplatform, documentai\n", + "from google.protobuf.json_format import MessageToDict\n", + "from langchain_community.document_loaders.blob_loaders import Blob\n", + "from langchain_community.document_loaders.parsers import DocAIParser\n", + "from langchain_core.documents.base import Document\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "from vertexai.generative_models import GenerationConfig, GenerativeModel" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TvQ81PjSiCuZ" + }, + "source": [ + "### Set Variables\n", + "\n", + "Set variables to run the tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ajbQb0eXu3xh" + }, + "outputs": [], + "source": [ + "ID = \"\".join(random.choices(string.ascii_lowercase + string.digits, k=4))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ct35zelZiELu" + }, + "outputs": [], + "source": [ + "# Dataset\n", + "PROCESSOR_ID = f\"preprocess-docs-llm-{ID}\"\n", + "LOCATION = REGION.split(\"-\")[0]\n", + "RAW_DATA_URI = \"gs://github-repo/embeddings/get_started_with_embedding_tuning\"\n", + "PROCESSED_DATA_URI = f\"{BUCKET_URI}/data/processed\"\n", + "PREPARED_DATA_URI = f\"{BUCKET_URI}/data/prepared\"\n", + "PROCESSED_DATA_OCR_URI = f\"{BUCKET_URI}/data/processed/ocr\"\n", + "PROCESSED_DATA_TUNING_URI = f\"{BUCKET_URI}/data/processed/tuning\"\n", + "\n", + "# Tuning\n", + "PIPELINE_ROOT = f\"{BUCKET_URI}/pipelines\"\n", + "BATCH_SIZE = 32 # @param {type:\"integer\"}\n", + "TRAINING_ACCELERATOR_TYPE = \"NVIDIA_TESLA_T4\" # @param {type:\"string\"}\n", + "TRAINING_MACHINE_TYPE = \"n1-standard-16\" # @param {type:\"string\"}\n", + "\n", + "# Serving\n", + "PREDICTION_ACCELERATOR_TYPE = \"NVIDIA_TESLA_A100\" # @param {type:\"string\"}\n", + "PREDICTION_ACCELERATOR_COUNT = 1 # @param {type:\"integer\"}\n", + "PREDICTION_MACHINE_TYPE = \"a2-highgpu-1g\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F2mkgxcciGiZ" + }, + "source": [ + "### Helpers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "40tjc8jmiH4B" + }, + "outputs": [], + "source": [ + "def create_processor(project_id: str, location: str, processor_display_name: str):\n", + " \"\"\"Create a Document AI processor.\"\"\"\n", + " client_options = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\n", + " client = documentai.DocumentProcessorServiceClient(client_options=client_options)\n", + "\n", + " parent = client.common_location_path(project_id, location)\n", + "\n", + " return client.create_processor(\n", + " parent=parent,\n", + " processor=documentai.Processor(\n", + " display_name=processor_display_name, type_=\"OCR_PROCESSOR\"\n", + " ),\n", + " )\n", + "\n", + "\n", + "def generate_queries(\n", + " chuck: str,\n", + " num_questions: int = 3,\n", + ") -> langchain_core.documents.base.Document:\n", + " \"\"\"A function to generate contextual queries based on preprocessed chuck\"\"\"\n", + "\n", + " model = GenerativeModel(\"gemini-1.0-pro-001\")\n", + "\n", + " generation_config = GenerationConfig(\n", + " max_output_tokens=2048, temperature=0.9, top_p=1\n", + " )\n", + "\n", + " safety_settings = {\n", + " generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", + " generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", + " generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", + " generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_NONE,\n", + " }\n", + "\n", + " prompt_template = \"\"\"\n", + " You are an examinator. Your task is to create one QUESTION for an exam using only.\n", + "\n", + " \n", + " {chuck}\n", + " \n", + "\n", + " QUESTION:\n", + " \"\"\"\n", + "\n", + " query = prompt_template.format(\n", + " chuck=chuck.page_content, num_questions=num_questions\n", + " )\n", + "\n", + " for idx in range(num_questions):\n", + " response = model.generate_content(\n", + " [query],\n", + " generation_config=generation_config,\n", + " safety_settings=safety_settings,\n", + " ).text\n", + "\n", + " return Document(\n", + " page_content=response, metadata={\"page\": chuck.metadata[\"page\"]}\n", + " )\n", + "\n", + "\n", + "def get_task_by_name(job: aiplatform.PipelineJob, task_name: str):\n", + " \"\"\"Get a Vertex AI Pipeline job task by its name\"\"\"\n", + " for task in job.task_details:\n", + " if task.task_name == task_name:\n", + " return task\n", + " raise ValueError(f\"Task {task_name} not found\")\n", + "\n", + "\n", + "def get_metrics(\n", + " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-evaluator\"\n", + "):\n", + " \"\"\"Get metrics for the evaluation task\"\"\"\n", + " evaluation_task = get_task_by_name(job, task_name)\n", + " metrics = MessageToDict(evaluation_task.outputs[\"metrics\"]._pb)[\"artifacts\"][0][\n", + " \"metadata\"\n", + " ]\n", + " metrics_df = pd.DataFrame([metrics])\n", + " return metrics_df\n", + "\n", + "\n", + "def get_uploaded_model(\n", + " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-model-uploader\"\n", + ") -> aiplatform.Model:\n", + " \"\"\"Get uploaded model from the pipeline job\"\"\"\n", + " evaluation_task = get_task_by_name(job, task_name)\n", + " upload_metadata = MessageToDict(evaluation_task.execution._pb)[\"metadata\"]\n", + " return aiplatform.Model(upload_metadata[\"output:model_resource_name\"])\n", + "\n", + "\n", + "def get_training_output_dir(\n", + " job: aiplatform.PipelineJob, task_name: str = \"text-embedding-trainer\"\n", + ") -> str:\n", + " \"\"\"Get training output directory for the pipeline job\"\"\"\n", + " trainer_task = get_task_by_name(job, task_name)\n", + " output_artifacts = MessageToDict(trainer_task.outputs[\"training_output\"]._pb)[\n", + " \"artifacts\"\n", + " ][0]\n", + " return output_artifacts[\"uri\"]\n", + "\n", + "\n", + "def get_top_k_scores(\n", + " query_embedding: pd.DataFrame, corpus_embeddings: pd.DataFrame, k=10\n", + ") -> pd.DataFrame:\n", + " \"\"\"Get top k similar scores for each query\"\"\"\n", + " similarity = corpus_embeddings.dot(query_embedding.T)\n", + " topk_index = pd.DataFrame({c: v.nlargest(n=k).index for c, v in similarity.items()})\n", + " return topk_index\n", + "\n", + "\n", + "def get_top_k_documents(\n", + " query_text: list[str],\n", + " corpus_text: pd.DataFrame,\n", + " corpus_embeddings: pd.DataFrame,\n", + " task_type: str = \"RETRIEVAL_DOCUMENT\",\n", + " title: str = \"\",\n", + " k: int = 10,\n", + ") -> pd.DataFrame:\n", + " \"\"\"Get top k similar documents for each query\"\"\"\n", + " instances = []\n", + " for text in query_text:\n", + " instances.append(\n", + " {\n", + " \"content\": text,\n", + " \"task_type\": task_type,\n", + " \"title\": title,\n", + " }\n", + " )\n", + "\n", + " response = endpoint.predict(instances=instances)\n", + " query_embedding = np.asarray(response.predictions)\n", + " topk = get_top_k_scores(query_embedding, corpus_embeddings, k)\n", + " return pd.DataFrame.from_dict(\n", + " {\n", + " query_text[c]: corpus_text.loc[v.values].values.ravel()\n", + " for c, v in topk.items()\n", + " },\n", + " orient=\"columns\",\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "init_aip:mbsdk,all" + }, + "source": [ + "### Initialize Vertex AI SDK for Python\n", + "\n", + "Initialize the Vertex AI SDK for Python for your project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u4p_LXiGhzVc" + }, + "outputs": [], + "source": [ + "vertexai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FpDyETD1iKi-" + }, + "source": [ + "## Tuning text embeddings\n", + "\n", + "To tune the model, you should start by preparing your model tuning dataset and then upload it to a Cloud Storage bucket. Text embedding models support supervised tuning, which uses labeled examples to demonstrate the desired output from the model during inference.\n", + "\n", + "Next, you create a model tuning job and deploy the tuned model to a Vertex AI endpoint.\n", + "\n", + "Finally, you retrive similar items using the tuned embedding model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WJx_WMx2wioP" + }, + "source": [ + "### Prepare your model tuning dataset using Document AI, Gemini API, and LangChain on Vertex AI\n", + "\n", + "The tuning dataset consists of the following files:\n", + "\n", + "- `corpus` file is a JSONL file where each line has the fields `_id`, `title` (optional), and `text` of each relevant chuck.\n", + "\n", + "- `query` file is a JSONL file where each line has the fields `_id`, and `text` of each relevant query.\n", + "\n", + "- `labels` files are TSV files (train, test and val) with three columns: `query-id`,`corpus-id`, and `score`. `query-id` represents the query id in the query file, `corpus-id` represents the corpus id in the corpus file, and `score` indicates relevance with higher scores meaning greater relevance. A default score of 1 is used if none is specified. The `train` file is required while `test` and `val` are optional.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0eDQ8BSFiOH9" + }, + "source": [ + "#### Create a Document AI preprocessor\n", + "\n", + "Create the OCR processor to identify and extract text in PDF document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZIJXJt6ciNdj" + }, + "outputs": [], + "source": [ + "processor = create_processor(PROJECT_ID, LOCATION, PROCESSOR_ID)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P8podNK-rxm6" + }, + "source": [ + "#### Parse the document using DocAI Parser in LangChain" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jCUWtEq3oLqs" + }, + "source": [ + "Initiate a LangChain parser." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "D_PU-I_-teWQ" + }, + "outputs": [], + "source": [ + "blob = Blob(\n", + " path=f\"{RAW_DATA_URI}/goog-10-k-2023.pdf\",\n", + ")\n", + "\n", + "parser = DocAIParser(\n", + " processor_name=processor.name,\n", + " location=LOCATION,\n", + " gcs_output_path=PROCESSED_DATA_OCR_URI,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hz254MGnoPgR" + }, + "source": [ + "Run a Google Document AI PDF Batch Processing job.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sjm3LSh7oKJk" + }, + "outputs": [], + "source": [ + "operations = parser.docai_parse([blob])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5_DpZodtvATw" + }, + "outputs": [], + "source": [ + "while True:\n", + " if parser.is_running(operations):\n", + " print(\"Waiting for DocAI to finish...\")\n", + " time.sleep(10)\n", + " else:\n", + " print(\"DocAI successfully processed!\")\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6MjwFF6loZfF" + }, + "source": [ + "Get the resulting LangChain Documents containing the extracted text and metadata." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oBh2AKdIvX8g" + }, + "outputs": [], + "source": [ + "results = parser.get_results(operations)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZGfde504wIcF" + }, + "outputs": [], + "source": [ + "docs = list(parser.parse_from_results(results))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rFsBLHpwWRCy" + }, + "outputs": [], + "source": [ + "docs[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MXGdZYB_x9R6" + }, + "source": [ + "#### Create document chunks using `RecursiveCharacterTextSplitter`\n", + "\n", + "You can create chucks using `RecursiveCharacterTextSplitter` in LangChain. The splitter divides text into smaller chunks of a chosen size based on a set of specified characters." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y_JIYFd7qePe" + }, + "source": [ + "Initiate the splitter." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TxBSDPGPyA4l" + }, + "outputs": [], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=2500,\n", + " chunk_overlap=250,\n", + " length_function=len,\n", + " is_separator_regex=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Le_WIyGCqg7O" + }, + "source": [ + "Create Text chunks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "M8BxdLjUycm4" + }, + "outputs": [], + "source": [ + "document_content = [doc.page_content for doc in docs]\n", + "document_metadata = [{\"page\": idx} for idx, doc in enumerate(docs, 1)]\n", + "chunks = text_splitter.create_documents(document_content, metadatas=document_metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3wm91wQnB-Xd" + }, + "source": [ + "#### Create queries\n", + "\n", + "You can utilize Gemini on Vertex AI to produce hypothetical questions that are relevant to a given piece of context (chunk). \n", + "This approach enables the generation of synthetic positive pairs of (query, relevant documents) in a scalable manner.\n", + "\n", + "Running the query generation would require **some minutes** depending on the number of chunks you have. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LsnSHzZBCCij" + }, + "outputs": [], + "source": [ + "generated_queries = [generate_queries(chuck=chuck, num_questions=3) for chuck in chunks]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x2XYblPb8Uvy" + }, + "source": [ + "#### Create the tuning training and test dataset files." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b78Kh1K0vP3s" + }, + "source": [ + "Create the `corpus` file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Zbl6sB3-8YdP" + }, + "outputs": [], + "source": [ + "corpus_df = pd.DataFrame(\n", + " {\n", + " \"_id\": [\"text_\" + str(idx) for idx in range(len(generated_queries))],\n", + " \"text\": [chuck.page_content for chuck in chunks],\n", + " \"doc_id\": [chuck.metadata[\"page\"] for chuck in chunks],\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dDCRCw-buq8U" + }, + "outputs": [], + "source": [ + "corpus_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_tu0Lx7FvVJe" + }, + "source": [ + "Create the `query` file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Fu1fFhdkrCoq" + }, + "outputs": [], + "source": [ + "query_df = pd.DataFrame(\n", + " {\n", + " \"_id\": [\"query_\" + str(idx) for idx in range(len(generated_queries))],\n", + " \"text\": [query.page_content for query in generated_queries],\n", + " \"doc_id\": [query.metadata[\"page\"] for query in generated_queries],\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Uo4yaw4Xu8wJ" + }, + "outputs": [], + "source": [ + "query_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Iv9dcRqFvYN-" + }, + "source": [ + "Create the `score` file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NMDpCp06wAcX" + }, + "outputs": [], + "source": [ + "score_df = corpus_df.merge(query_df, on=\"doc_id\")\n", + "score_df = score_df.rename(columns={\"_id_x\": \"corpus-id\", \"_id_y\": \"query-id\"})\n", + "score_df = score_df.drop(columns=[\"doc_id\", \"text_x\", \"text_y\"])\n", + "score_df[\"score\"] = 1\n", + "train_df = score_df.sample(frac=0.8)\n", + "test_df = score_df.drop(train_df.index)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Tc-JAxKwxoar" + }, + "outputs": [], + "source": [ + "train_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YGgY3L-CyhAQ" + }, + "source": [ + "#### Save the tuning dataset\n", + "\n", + "Upload the model tuning datasets to a Cloud Storage bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HHiC-Qg8yorH" + }, + "outputs": [], + "source": [ + "corpus_df.to_json(\n", + " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/corpus.jsonl\",\n", + " orient=\"records\",\n", + " lines=True,\n", + ")\n", + "query_df.to_json(\n", + " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/query.jsonl\", orient=\"records\", lines=True\n", + ")\n", + "train_df.to_csv(\n", + " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/train.tsv\",\n", + " sep=\"\\t\",\n", + " header=True,\n", + " index=False,\n", + ")\n", + "test_df.to_csv(\n", + " f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/test.tsv\",\n", + " sep=\"\\t\",\n", + " header=True,\n", + " index=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6p9NR9I31fo6" + }, + "source": [ + "### Run an embedding tuning job on Vertex AI Pipelines\n", + "\n", + "Next, set the tuning pipeline parameters including the Cloud Storage bucket paths with train and test datasets, the training batch size and the number of steps to perform model tuning. \n", + "\n", + "For more information about pipeline parameters, [check](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings#create-embedding-tuning-job) the official tuning documentation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YePRoZg31iSJ" + }, + "outputs": [], + "source": [ + "ITERATIONS = len(train_df) // BATCH_SIZE\n", + "\n", + "params = {\n", + " \"batch_size\": BATCH_SIZE,\n", + " \"iterations\": ITERATIONS,\n", + " \"accelerator_type\": TRAINING_ACCELERATOR_TYPE,\n", + " \"machine_type\": TRAINING_MACHINE_TYPE,\n", + " \"base_model_version_id\": \"textembedding-gecko@003\",\n", + " \"queries_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/query.jsonl\",\n", + " \"corpus_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/corpus.jsonl\",\n", + " \"train_label_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/train.tsv\",\n", + " \"test_label_path\": f\"{PROCESSED_DATA_TUNING_URI}/{TIMESTAMP}/test.tsv\",\n", + " \"project\": PROJECT_ID,\n", + " \"location\": REGION,\n", + "}\n", + "\n", + "template_uri = \"https://us-kfp.pkg.dev/ml-pipeline/llm-text-embedding/tune-text-embedding-model/v1.1.1\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0rN_XWFjxWZn" + }, + "source": [ + "Run the model tuning pipeline job." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "m7JEoMT-1mAC" + }, + "outputs": [], + "source": [ + "job = aiplatform.PipelineJob(\n", + " display_name=\"tune-text-embedding\",\n", + " parameter_values=params,\n", + " template_path=template_uri,\n", + " pipeline_root=PIPELINE_ROOT,\n", + " project=PROJECT_ID,\n", + " location=REGION,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ExF5xlj0uBjK" + }, + "outputs": [], + "source": [ + "job.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UJCufZmS7SVM" + }, + "source": [ + "### Evaluate the tuned model\n", + "\n", + "Evaluate the tuned embedding model. The Vertex AI Pipeline automatically produces NDCG (Normalized Discounted Cumulative Gain) for both training and test datasets. NDCG measures ranking effectiveness taking position of relevant items in the ranked list.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eldc535Y7xmD" + }, + "outputs": [], + "source": [ + "metric_df = get_metrics(job)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W-s_AEuoaKSd" + }, + "outputs": [], + "source": [ + "metric_df.to_dict()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qMBUmPFgBRWi" + }, + "outputs": [], + "source": [ + "metric_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZP6nCnJ2_6wU" + }, + "source": [ + "### Deploy the embedding tuned model on Vertex AI Prediction\n", + "\n", + "To deploy the embedding tuned model, you need to create an Vertex AI Endpoint.\n", + "\n", + "Then you deploy the tuned embeddings model to the endpoint." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d5LtEGEbAHPd" + }, + "source": [ + "#### Create the endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Q2KRaRzHAF8f" + }, + "outputs": [], + "source": [ + "endpoint = aiplatform.Endpoint.create(\n", + " display_name=\"tuned_custom_embedding_endpoint\",\n", + " description=\"Endpoint for tuned model embeddings.\",\n", + " project=PROJECT_ID,\n", + " location=REGION,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xGLjbFY-AYi3" + }, + "source": [ + "#### Deploy the tuned model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mdfedMEj1ZNy" + }, + "source": [ + "Get the tuned model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ndE5WGVkBQA2" + }, + "outputs": [], + "source": [ + "model = get_uploaded_model(job)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CqkAAzvV1ewQ" + }, + "source": [ + "Deploy the tuned model to the endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nOZyyPALDcSP" + }, + "outputs": [], + "source": [ + "endpoint.deploy(\n", + " model,\n", + " accelerator_type=PREDICTION_ACCELERATOR_TYPE,\n", + " accelerator_count=PREDICTION_ACCELERATOR_COUNT,\n", + " machine_type=PREDICTION_MACHINE_TYPE,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8VVac8ah2DzJ" + }, + "source": [ + "### Retrieve similar items using the tuned embedding model\n", + "\n", + "To retrieve similar items using the tuned embedding model, you need both the corpus text and the generated embeddings. Given a query, you will calculate the associated embeddings with the tuned model and you will apply a similarity function to find the most relevant document with respect the query. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L2ohcM4a6mOx" + }, + "source": [ + "Read the corpus text and the generated embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "C5O2WP_i1z3I" + }, + "outputs": [], + "source": [ + "training_output_dir = get_training_output_dir(job)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TJcTLpV263tK" + }, + "outputs": [], + "source": [ + "corpus_text = pd.read_json(\n", + " epath.Path(training_output_dir) / \"corpus_text.jsonl\", lines=True\n", + ")\n", + "corpus_text.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3g-8ECt9VU_P" + }, + "outputs": [], + "source": [ + "corpus_embeddings = pd.read_json(\n", + " epath.Path(training_output_dir) / \"corpus_custom.jsonl\", lines=True\n", + ")\n", + "\n", + "corpus_embeddings.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Dwbak5t670Y" + }, + "source": [ + "Find the most relevant documents for each query." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IV9d2afP11em" + }, + "outputs": [], + "source": [ + "queries = [\n", + " \"\"\"What about the revenues?\"\"\",\n", + " \"\"\"Who is Alphabet?\"\"\",\n", + " \"\"\"What about the costs?\"\"\",\n", + "]\n", + "output = get_top_k_documents(queries, corpus_text, corpus_embeddings, k=10)\n", + "\n", + "with pd.option_context(\"display.max_colwidth\", 200):\n", + " display(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TpV-iwP9qw9c" + }, + "source": [ + "## Cleaning up\n", + "\n", + "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", + "\n", + "Otherwise, you can delete the individual resources you created in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sx_vKniMq9ZX" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "delete_endpoint = False\n", + "delete_model = False\n", + "delete_job = False\n", + "delete_bucket = False\n", + "\n", + "# Delete endpoint resource\n", + "if delete_endpoint or os.getenv(\"IS_TESTING\"):\n", + " endpoint.delete()\n", + "\n", + "# Delete model resource\n", + "if delete_model or os.getenv(\"IS_TESTING\"):\n", + " model.delete()\n", + "\n", + "# Delete pipeline job\n", + "if delete_job or os.getenv(\"IS_TESTING\"):\n", + " job.delete()\n", + "\n", + "# Delete Cloud Storage objects that were created\n", + "if delete_bucket or os.getenv(\"IS_TESTING\"):\n", + " ! gsutil -m rm -r $BUCKET_URI" + ] + } + ], + "metadata": { + "colab": { + "name": "get_started_with_embedding_tuning.ipynb", + "toc_visible": true + }, + "environment": { + "kernel": "python3", + "name": "tf2-cpu.2-11.m116", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/tf2-cpu.2-11:m116" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 }