Merge pull request #124 from ibm-client-engineering/shirley-dev-2

Minor updates on preprocessing instruction
ibm-client-engineering · Jun 7, 2024 · e0de608 · e0de608
2 parents 19bbce3 + b41bd35
commit e0de608
Show file tree

Hide file tree

Showing 3 changed files with 3 additions and 2 deletions.
diff --git a/docs/3-Use-Cases/NeuralSeek.mdx b/docs/3-Use-Cases/NeuralSeek.mdx
@@ -96,7 +96,7 @@ In addition to testing on NeuralSeek, we have written a script to allow testing
 We performed Pre-Processing and No OCR, No Pre-Processing and No OCR, and OCR experiments using the testing notebook.
 You can and run the different experiments just by changing the Discovery collection ID and providing with the questions and expected responses as string arrays.
 It uses the NeuralSeek API.
-Please refer to [Testing Notebook](Tables%20Testing.ipynb) for detailed steps.
+Please refer to [Testing Notebook](testing.ipynb) for detailed steps.
 
 ## Download Logs
 - Proceed to API on Integrate tab

diff --git a/docs/3-Use-Cases/Watson Discovery.mdx b/docs/3-Use-Cases/Watson Discovery.mdx
@@ -8,7 +8,8 @@ custom_edit_url: null
 # Data Preprocessing
 
 - Data containing tables needs to be pre-processed so that the LLM can properly read the content in tables. 
-- Before uploading to watson discovery, run the following script on your files and upload the files generated by the output to Watson Discovery: [link](preprocess_file.ipynb)
+- Before uploading to Watson Discovery, run the following script on your files if needed and upload the files generated by the output to Watson Discovery: [link](preprocess_file.ipynb). Additional changes should be applied to this script in order to make it customized for your PDF documents. The provided script is for a specific set of documents that we used.
+- The script interates through each page of the PDF file, finds all the tables, and transforms each table into natural language format utilizing LLM. Having tables in a natural language format will help with question and answering. The code will preprocess PDFs and output HTML files. 
 
 # Create Project and Collection
 

diff --git a/docs/3-Use-Cases/Tables Testing.ipynb → docs/3-Use-Cases/testing.ipynb b/docs/3-Use-Cases/Tables Testing.ipynb → docs/3-Use-Cases/testing.ipynb