polong review

inardini · Apr 22, 2024 · ded90ad · ded90ad
1 parent 036be20
commit ded90ad
Showing 1 changed file with 37 additions and 56 deletions.
diff --git a/...s/get_started_with_embedding_tuning.ipynb → embeddings/intro_embeddings_tuning.ipynb b/...s/get_started_with_embedding_tuning.ipynb → embeddings/intro_embeddings_tuning.ipynb
@@ -33,17 +33,22 @@
     "\n",
     "<table align=\"left\">\n",
     "  <td style=\"text-align: center\">\n",
-    "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fembeddings%2Fget_started_with_embedding_tuning.ipynb\">\n",
+    "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/intro_embeddings_tuning.ipynb\">\n",
+    "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
+    "    </a>\n",
+    "  </td>\n",
+    "  <td style=\"text-align: center\">\n",
+    "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fembeddings%2Fintro_embeddings_tuning.ipynb\">\n",
     "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
     "    </a>\n",
     "  </td>    \n",
     "  <td style=\"text-align: center\">\n",
-    "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/embeddings/get_started_with_embedding_tuning.ipynb\">\n",
+    "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/embeddings/intro_embeddings_tuning.ipynb\">\n",
     "      <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Workbench\n",
     "    </a>\n",
     "  </td>\n",
     "  <td style=\"text-align: center\">\n",
-    "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/get_started_with_embedding_tuning.ipynb\">\n",
+    "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/intro_embeddings_tuning.ipynb\">\n",
     "      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n",
     "    </a>\n",
     "  </td>\n",
@@ -82,6 +87,8 @@
    "source": [
     "### Objective\n",
     "\n",
+    "Large Language Models (LLMs) face challenges in information retrieval due to hallucination, where they generate potentially inaccurate information. Retrieval-Augmented Generation (RAG) addresses this issue by using a retrieval component to identify relevant information in a knowledge base before passing it to the LLM for generation. To improve retrieval effectiveness, meaningful representation of queries and content is crucial, which can be achieved by fine-tuning the embedding model with retrieval-specific domain data. \n",
+    "\n",
     "In this tutorial, you learn how to tune the text embedding model, `textembedding-gecko`.\n",
     "\n",
     "This tutorial uses the following Google Cloud ML services and resources:\n",
@@ -274,60 +281,33 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "id": "sBCra4QMA2wR"
+    "tags": []
    },
    "source": [
-    "### Authenticate your Google Cloud account\n",
+    "### Authenticate your notebook environment (Colab only)\n",
     "\n",
-    "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "74ccc9e52986"
-   },
-   "source": [
-    "**1. Vertex AI Workbench and Colab Enterprise**\n",
-    "* Do nothing as you are already authenticated."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "de775a3773ba"
-   },
-   "source": [
-    "**2. Local JupyterLab instance, uncomment and run:**"
+    "If you are running this notebook on Google Colab, run the cell below to authenticate your environment."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "254614fa0c46"
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "# ! gcloud auth login"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "f6b2ccc891ed"
-   },
-   "source": [
-    "**3. Service account or other**\n",
-    "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples."
+    "import sys\n",
+    "\n",
+    "if \"google.colab\" in sys.modules:\n",
+    "    from google.colab import auth\n",
+    "\n",
+    "    auth.authenticate_user()"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "zgPO1eR3CYjk"
-   },
+   "metadata": {},
    "source": [
+    "\n",
     "### Create a Cloud Storage bucket\n",
     "\n",
     "Create a storage bucket to store intermediate artifacts such as datasets."
@@ -515,10 +495,10 @@
     "\n",
     "\n",
     "def generate_queries(\n",
-    "    chuck: str,\n",
+    "    chunk: str,\n",
     "    num_questions: int = 3,\n",
     ") -> langchain_core.documents.base.Document:\n",
-    "    \"\"\"A function to generate contextual queries based on preprocessed chuck\"\"\"\n",
+    "    \"\"\"A function to generate contextual queries based on preprocessed chunk\"\"\"\n",
     "\n",
     "    model = GenerativeModel(\"gemini-1.0-pro-001\")\n",
     "\n",
@@ -537,14 +517,14 @@
     "    You are an examinator. Your task is to create one QUESTION for an exam using <DOCUMENT> only.\n",
     "\n",
     "    <DOCUMENT>\n",
-    "    {chuck}\n",
+    "    {chunk}\n",
     "    <DOCUMENT/>\n",
     "\n",
     "    QUESTION:\n",
     "    \"\"\"\n",
     "\n",
     "    query = prompt_template.format(\n",
-    "        chuck=chuck.page_content, num_questions=num_questions\n",
+    "        chunk=chunk.page_content, num_questions=num_questions\n",
     "    )\n",
     "\n",
     "    for idx in range(num_questions):\n",
@@ -555,7 +535,7 @@
     "        ).text\n",
     "\n",
     "        return Document(\n",
-    "            page_content=response, metadata={\"page\": chuck.metadata[\"page\"]}\n",
+    "            page_content=response, metadata={\"page\": chunk.metadata[\"page\"]}\n",
     "        )\n",
     "\n",
     "\n",
@@ -585,7 +565,7 @@
     "    \"\"\"Get uploaded model from the pipeline job\"\"\"\n",
     "    evaluation_task = get_task_by_name(job, task_name)\n",
     "    upload_metadata = MessageToDict(evaluation_task.execution._pb)[\"metadata\"]\n",
-    "    return aiplatform.Model(upload_metadata[\"output:model_resource_name\"])\n",
+    "    return vertexai.Model(upload_metadata[\"output:model_resource_name\"])\n",
     "\n",
     "\n",
     "def get_training_output_dir(\n",
@@ -686,7 +666,7 @@
     "\n",
     "The tuning dataset consists of the following files:\n",
     "\n",
-    "- `corpus` file is a JSONL file where each line has the fields `_id`, `title` (optional), and `text` of each relevant chuck.\n",
+    "- `corpus` file is a JSONL file where each line has the fields `_id`, `title` (optional), and `text` of each relevant chunk.\n",
     "\n",
     "- `query` file is a JSONL file where each line has the fields `_id`, and `text` of each relevant query.\n",
     "\n",
@@ -839,7 +819,7 @@
    "source": [
     "#### Create document chunks using `RecursiveCharacterTextSplitter`\n",
     "\n",
-    "You can create chucks using `RecursiveCharacterTextSplitter` in LangChain. The splitter divides text into smaller chunks of a chosen size based on a set of specified characters."
+    "You can create chunks using `RecursiveCharacterTextSplitter` in LangChain. The splitter divides text into smaller chunks of a chosen size based on a set of specified characters."
    ]
   },
   {
@@ -911,7 +891,7 @@
    },
    "outputs": [],
    "source": [
-    "generated_queries = [generate_queries(chuck=chuck, num_questions=3) for chuck in chunks]"
+    "generated_queries = [generate_queries(chunk=chunk, num_questions=3) for chunk in chunks]"
    ]
   },
   {
@@ -943,8 +923,8 @@
     "corpus_df = pd.DataFrame(\n",
     "    {\n",
     "        \"_id\": [\"text_\" + str(idx) for idx in range(len(generated_queries))],\n",
-    "        \"text\": [chuck.page_content for chuck in chunks],\n",
-    "        \"doc_id\": [chuck.metadata[\"page\"] for chuck in chunks],\n",
+    "        \"text\": [chunk.page_content for chunk in chunks],\n",
+    "        \"doc_id\": [chunk.metadata[\"page\"] for chunk in chunks],\n",
     "    }\n",
     ")"
    ]
@@ -1131,7 +1111,7 @@
    },
    "outputs": [],
    "source": [
-    "job = aiplatform.PipelineJob(\n",
+    "job = vertexai.PipelineJob(\n",
     "    display_name=\"tune-text-embedding\",\n",
     "    parameter_values=params,\n",
     "    template_path=template_uri,\n",
@@ -1160,7 +1140,8 @@
    "source": [
     "### Evaluate the tuned model\n",
     "\n",
-    "Evaluate the tuned embedding model. The Vertex AI Pipeline automatically produces NDCG (Normalized Discounted Cumulative Gain) for both training and test datasets. NDCG measures ranking effectiveness taking position of relevant items in the ranked list.\n"
+    "Evaluate the tuned embedding model. The Vertex AI Pipeline automatically produces NDCG (Normalized Discounted Cumulative Gain) for both training and test datasets. \n",
+    "Tuning the model should results in a nDCG@10 improvement compared with the base textembedding-gecko model which means that top 10 chunks that will be retrieved are now more likely to be exactly the relevant ones for answering the input query. In other words, the most relevant information is now easier to find with the new tuned embedding model. \n"
    ]
   },
   {
@@ -1226,7 +1207,7 @@
    },
    "outputs": [],
    "source": [
-    "endpoint = aiplatform.Endpoint.create(\n",
+    "endpoint = vertexai.Endpoint.create(\n",
     "    display_name=\"tuned_custom_embedding_endpoint\",\n",
     "    description=\"Endpoint for tuned model embeddings.\",\n",
     "    project=PROJECT_ID,\n",