Skip to content

Commit

Permalink
polong review
Browse files Browse the repository at this point in the history
  • Loading branch information
inardini committed Apr 22, 2024
1 parent 036be20 commit ded90ad
Showing 1 changed file with 37 additions and 56 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,22 @@
"\n",
"<table align=\"left\">\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fembeddings%2Fget_started_with_embedding_tuning.ipynb\">\n",
" <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/intro_embeddings_tuning.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fembeddings%2Fintro_embeddings_tuning.ipynb\">\n",
" <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
" </a>\n",
" </td> \n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/embeddings/get_started_with_embedding_tuning.ipynb\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/embeddings/intro_embeddings_tuning.ipynb\">\n",
" <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Workbench\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/get_started_with_embedding_tuning.ipynb\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/intro_embeddings_tuning.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
Expand Down Expand Up @@ -82,6 +87,8 @@
"source": [
"### Objective\n",
"\n",
"Large Language Models (LLMs) face challenges in information retrieval due to hallucination, where they generate potentially inaccurate information. Retrieval-Augmented Generation (RAG) addresses this issue by using a retrieval component to identify relevant information in a knowledge base before passing it to the LLM for generation. To improve retrieval effectiveness, meaningful representation of queries and content is crucial, which can be achieved by fine-tuning the embedding model with retrieval-specific domain data. \n",
"\n",
"In this tutorial, you learn how to tune the text embedding model, `textembedding-gecko`.\n",
"\n",
"This tutorial uses the following Google Cloud ML services and resources:\n",
Expand Down Expand Up @@ -274,60 +281,33 @@
{
"cell_type": "markdown",
"metadata": {
"id": "sBCra4QMA2wR"
"tags": []
},
"source": [
"### Authenticate your Google Cloud account\n",
"### Authenticate your notebook environment (Colab only)\n",
"\n",
"Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "74ccc9e52986"
},
"source": [
"**1. Vertex AI Workbench and Colab Enterprise**\n",
"* Do nothing as you are already authenticated."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "de775a3773ba"
},
"source": [
"**2. Local JupyterLab instance, uncomment and run:**"
"If you are running this notebook on Google Colab, run the cell below to authenticate your environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "254614fa0c46"
},
"metadata": {},
"outputs": [],
"source": [
"# ! gcloud auth login"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f6b2ccc891ed"
},
"source": [
"**3. Service account or other**\n",
"* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples."
"import sys\n",
"\n",
"if \"google.colab\" in sys.modules:\n",
" from google.colab import auth\n",
"\n",
" auth.authenticate_user()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zgPO1eR3CYjk"
},
"metadata": {},
"source": [
"\n",
"### Create a Cloud Storage bucket\n",
"\n",
"Create a storage bucket to store intermediate artifacts such as datasets."
Expand Down Expand Up @@ -515,10 +495,10 @@
"\n",
"\n",
"def generate_queries(\n",
" chuck: str,\n",
" chunk: str,\n",
" num_questions: int = 3,\n",
") -> langchain_core.documents.base.Document:\n",
" \"\"\"A function to generate contextual queries based on preprocessed chuck\"\"\"\n",
" \"\"\"A function to generate contextual queries based on preprocessed chunk\"\"\"\n",
"\n",
" model = GenerativeModel(\"gemini-1.0-pro-001\")\n",
"\n",
Expand All @@ -537,14 +517,14 @@
" You are an examinator. Your task is to create one QUESTION for an exam using <DOCUMENT> only.\n",
"\n",
" <DOCUMENT>\n",
" {chuck}\n",
" {chunk}\n",
" <DOCUMENT/>\n",
"\n",
" QUESTION:\n",
" \"\"\"\n",
"\n",
" query = prompt_template.format(\n",
" chuck=chuck.page_content, num_questions=num_questions\n",
" chunk=chunk.page_content, num_questions=num_questions\n",
" )\n",
"\n",
" for idx in range(num_questions):\n",
Expand All @@ -555,7 +535,7 @@
" ).text\n",
"\n",
" return Document(\n",
" page_content=response, metadata={\"page\": chuck.metadata[\"page\"]}\n",
" page_content=response, metadata={\"page\": chunk.metadata[\"page\"]}\n",
" )\n",
"\n",
"\n",
Expand Down Expand Up @@ -585,7 +565,7 @@
" \"\"\"Get uploaded model from the pipeline job\"\"\"\n",
" evaluation_task = get_task_by_name(job, task_name)\n",
" upload_metadata = MessageToDict(evaluation_task.execution._pb)[\"metadata\"]\n",
" return aiplatform.Model(upload_metadata[\"output:model_resource_name\"])\n",
" return vertexai.Model(upload_metadata[\"output:model_resource_name\"])\n",
"\n",
"\n",
"def get_training_output_dir(\n",
Expand Down Expand Up @@ -686,7 +666,7 @@
"\n",
"The tuning dataset consists of the following files:\n",
"\n",
"- `corpus` file is a JSONL file where each line has the fields `_id`, `title` (optional), and `text` of each relevant chuck.\n",
"- `corpus` file is a JSONL file where each line has the fields `_id`, `title` (optional), and `text` of each relevant chunk.\n",
"\n",
"- `query` file is a JSONL file where each line has the fields `_id`, and `text` of each relevant query.\n",
"\n",
Expand Down Expand Up @@ -839,7 +819,7 @@
"source": [
"#### Create document chunks using `RecursiveCharacterTextSplitter`\n",
"\n",
"You can create chucks using `RecursiveCharacterTextSplitter` in LangChain. The splitter divides text into smaller chunks of a chosen size based on a set of specified characters."
"You can create chunks using `RecursiveCharacterTextSplitter` in LangChain. The splitter divides text into smaller chunks of a chosen size based on a set of specified characters."
]
},
{
Expand Down Expand Up @@ -911,7 +891,7 @@
},
"outputs": [],
"source": [
"generated_queries = [generate_queries(chuck=chuck, num_questions=3) for chuck in chunks]"
"generated_queries = [generate_queries(chunk=chunk, num_questions=3) for chunk in chunks]"
]
},
{
Expand Down Expand Up @@ -943,8 +923,8 @@
"corpus_df = pd.DataFrame(\n",
" {\n",
" \"_id\": [\"text_\" + str(idx) for idx in range(len(generated_queries))],\n",
" \"text\": [chuck.page_content for chuck in chunks],\n",
" \"doc_id\": [chuck.metadata[\"page\"] for chuck in chunks],\n",
" \"text\": [chunk.page_content for chunk in chunks],\n",
" \"doc_id\": [chunk.metadata[\"page\"] for chunk in chunks],\n",
" }\n",
")"
]
Expand Down Expand Up @@ -1131,7 +1111,7 @@
},
"outputs": [],
"source": [
"job = aiplatform.PipelineJob(\n",
"job = vertexai.PipelineJob(\n",
" display_name=\"tune-text-embedding\",\n",
" parameter_values=params,\n",
" template_path=template_uri,\n",
Expand Down Expand Up @@ -1160,7 +1140,8 @@
"source": [
"### Evaluate the tuned model\n",
"\n",
"Evaluate the tuned embedding model. The Vertex AI Pipeline automatically produces NDCG (Normalized Discounted Cumulative Gain) for both training and test datasets. NDCG measures ranking effectiveness taking position of relevant items in the ranked list.\n"
"Evaluate the tuned embedding model. The Vertex AI Pipeline automatically produces NDCG (Normalized Discounted Cumulative Gain) for both training and test datasets. \n",
"Tuning the model should results in a nDCG@10 improvement compared with the base textembedding-gecko model which means that top 10 chunks that will be retrieved are now more likely to be exactly the relevant ones for answering the input query. In other words, the most relevant information is now easier to find with the new tuned embedding model. \n"
]
},
{
Expand Down Expand Up @@ -1226,7 +1207,7 @@
},
"outputs": [],
"source": [
"endpoint = aiplatform.Endpoint.create(\n",
"endpoint = vertexai.Endpoint.create(\n",
" display_name=\"tuned_custom_embedding_endpoint\",\n",
" description=\"Endpoint for tuned model embeddings.\",\n",
" project=PROJECT_ID,\n",
Expand Down

0 comments on commit ded90ad

Please sign in to comment.