Updated NS, data preprocessing, watsonx.ai, and source attribution pages

ibm-client-engineering · Aug 7, 2024 · e4f0748 · e4f0748
1 parent 124d0b3
commit e4f0748
Show file tree

Hide file tree

Showing 6 changed files with 61 additions and 7 deletions.
diff --git a/docs/02-Create/01-Data Preprocessing.mdx b/docs/02-Create/01-Data Preprocessing.mdx
@@ -10,13 +10,15 @@ custom_edit_url: null
 Data preprocessing is a crucial step in implementing Retrieval Augmented Generation (RAG) functionality, as it enables the transformation of raw data into a structured and contextualized format that can be effectively processed by Large Language Models (LLMs). This process involves a series of steps, including data merging, contextualizing, and preparation, to ensure that the input data is accurate, consistent, and suitable for RAG analysis. By preprocessing the data, we can unlock the full potential of LLMs and Watson Discovery, enabling more accurate and reliable results in RAG-related applications.
 ## 2. Input Data
 - Multiple Excel sheets with product information
-- Columns: Product Family, Part number, description, license metric, and other relevant data
+- Columns: Product family, part number, description, license metric, and other relevant data
 - This is an example excel sheet with product number and product description
 ![](../../assets/screenshots/data_proprocessing_excel.png)
 ## 3. Data Merging
 - Merging multiple Excel sheets into a single dataset
 - Importance of data consistency and accuracy
     - beware of spaces in data (e.g. "Yes" vs "Yes ")
+- Merged cells need to be unmerged to ensure each row has all the necessary columns
+- Ensure to add a new column for URLs if the information needs to be displayed from the watsonx assistant side
 ## 4. Contextualizing Data for LLM and RAG
 - When a CSV file is uploaded to Watson Discovery, each row is transformed into a JSON object, allowing for efficient processing and analysis. However, for RAG use cases, the data can be scattered and fragmented, making it challenging for Large Language Models (LLMs) to process effectively. 
 - To address this limitation, we developed a custom data pipeline that transforms all columns of data into a single paragraph of information. This pipeline enables the creation of a cohesive and structured input for LLMs, facilitating more accurate and reliable processing of RAG-related data.
@@ -30,4 +32,4 @@ Data preprocessing is a crucial step in implementing Retrieval Augmented Generat
 ## 5. Preparing Data for Watson Discovery
 - Saving data as CSV files as Watson Discovery separates each CSV row into independent documents that could then be used for RAG and LLM Search.
 ## 6. Conclusion
-- Recap of data preprocessing steps 
+- With spreadsheets as input data, it is important to clean and contextualize the data using a custom data pipeline. This will ensure the RAG technique to work properly and allow LLM to understand the input data better. Then, store data in Watson Discovery before proceeding with integrating with watsonx assistant, watsonx.ai, and NeuralSeek.
diff --git a/docs/02-Create/03-Watsonx Assistant Setup.mdx b/docs/02-Create/03-Watsonx Assistant Setup.mdx
@@ -92,3 +92,6 @@ Extension setup will look similar to screenshot below
 - **In NeuralSeek Search action, set query_text to expression `input.text`**. This will allow the autocorrected input to be passed to NeuralSeek extension. Originally, query_text was set to expression `input.original_text`, if users entered typos, the typoed text would be passed to NeuralSeek. 
 ![](./assets/watsonx-assistant-neuralseek-search-inputtext.png)
 - Reference: [Correcting user input](https://cloud.ibm.com/docs/assistant-data?topic=assistant-data-dialog-runtime-spell-check)
+
+## Dynamic Links 
+- To create dynamic links based on user input, you could store the user input in a variable and then appending the variable to a specific URL within watsonx assistant action step. Then, allow assistant to output the link as part of its response.
diff --git a/docs/02-Create/04-NeuralSeek Setup.mdx b/docs/02-Create/04-NeuralSeek Setup.mdx
@@ -42,17 +42,54 @@ First make sure you have a project created within watsonx.ai.
 
 ## Configuration & Tuning
 Please see below for recommended settings:
-- Knowledge Base Tuning:
+- Knowledgebase Connection:
+    - Curation Data Field: text
+    - Link Field: url
+    - Document Name Field: extracted_metadata.title
+    - Attribute sources inside LLM Conext by Document Name: Enabled
+    - Return the full document instead of passages (only enable this if all of your documents are short): Enabled
 
+- Knowledge Base Tuning:
+    - Document Score Range: 0.8
+    - Max Documents Per Seek: 1
+    - Document Data Penalty: 0
+    - Snippet Size: 1000
+    - KnowledgeBase Query Cache (minutes): 0
+
+- Prompt Engineering
+    - Weight Tuning Section
+        - Temperature: 0
+        - Top Probability: 0.7
+        - Frequency Penalty: -0.3
+        - Maximum Tokens: 0
 
 - Answer Engineering & Preferences
-
+    - For "How verbose should an answer be", select second tier using the slider.
 
 - Intent Matching & Cache Configuration
-
+    - Edited answer cache: 3
+    - Normal answer cache: 5
+    - Required Cache to Follow Context? Yes
+    - Required Cache to match the exact KB for the question and not the intent? No
 
 - Governance & Guardrails
-
+    - For the Semantic Score section, please turn on the following using the toggle:
+        - Enable the Semantic Score Model
+        - Use Semantic Score as the basis for Warning & Minimum confidence. Do NOT Enable for usecases requiring language translation. 
+        - Rerank the search results based on the Semantic Match
+    - Semantic Tuning  
+        - Missing key search term penalty: 0.4
+        - Missing search term penalty: 0.25
+        - Source jump penalty: 6
+        - Total coverage weight: 0.25
+        - Rerank min coverage %: 0.25
+    - Warning confidence
+        - Confidence % for warning: 5
+    - Minimum Confidence 
+        - Minimum Confidence %: 0
+        - Minimum Confidence % to display a URL: 5
+        - Minimum Text: 0
+        - Maximum Length: 100
 
 ## Testing
 -  Navigate to "Seek" tab. Test NeuralSeek with questions that are relevant to your documents, e.g. "What products or services do you offer?" You will be able to see the NeuralSeek answer with response details, metrics, and source.

diff --git a/docs/02-Create/05-Source Attribution.mdx b/docs/02-Create/05-Source Attribution.mdx
@@ -6,5 +6,17 @@ custom_edit_url: null
 
 # Source Attribution
 
-Details regarding the technical solution on how we configured Watsonx Assistant and NeuralSeek to enable Source Attribution, to display clickable link to the users.
+Details regarding the technical solution on how we configured Watsonx Assistant and NeuralSeek to enable Source Attribution, to display clickable links to the users.
+In NeuralSeek under Knowledgebase Connection, make sure to have the following fields:
+    - Curation Data Field: text
+    - Link Field: url
+    - Document Name Field: extracted_metadata.title
+    - Attribute sources inside LLM Conext by Document Name: Enabled
+    - Return the full document instead of passages (only enable this if all of your documents are short): Enabled
 
+In watsonx asssitant, we can create a specific action step that will display the corresponding URL when a keyword is mentioned by the user.
+An example is that we can have a condition that if body.answer contains 'Lead Architect', a specific step is ran succesfully, and body.passages is defined, then we can have the assistant reply with the corresponding link from the LLM response.
+![](./assets/assistant_action_source_attribution.png)
+
+Toggle to the JSON view to ensure the accessing the URL information is correct. In this case, it is ${step_519_result_1.body.passages[0].url}.
+![](./assets/assistant_response_json.png)
diff --git a/docs/02-Create/assets/assistant_action_source_attribution.png b/docs/02-Create/assets/assistant_action_source_attribution.png
diff --git a/docs/02-Create/assets/assistant_response_json.png b/docs/02-Create/assets/assistant_response_json.png