updated README with information about nlm-ingestor, added bbox to blo…

…cks and added to_html, to_text to doc
nlmatics · Jan 24, 2024 · 35bd80f · 35bd80f
1 parent ca66df1
commit 35bd80f
Show file tree

Hide file tree

Showing 3 changed files with 44 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -2,6 +2,17 @@
 
 LLM Sherpa provides strategic APIs to accelerate large language model (LLM) use cases.
 
+## What's New
+
+> [!IMPORTANT]
+> llmsherpa back end service is now fully open sourced under Apache 2.0 Licence. See [https://github.com/nlmatics/nlm-ingestor](https://github.com/nlmatics/nlm-ingestor)
+> - You can now run your own servers using a docker image!
+> - Support for different file formats: DOCX, PPTX, HTML, TXT, XML
+> - OCR Support is built in
+> - Blocks now have co-ordinates - use bbox propery of blocks such as sections
+> - A new indent parser to better align all headings in a document to their corresponding level
+> - The free server and paid server are not updated with latest code and users are requested to spawn their own servers using instructions in nlm-ingestor
+
 ## LayoutPDFReader
 
 Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG). 
@@ -19,27 +30,29 @@ LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layo
 
 With LayoutPDFReader, developers can find optimal chunks of text to vectorize, and a solution for limited context window sizes of LLMs. 
 
-You can experiment with the library directly in Google Colab [here](https://colab.research.google.com/drive/1hx5Y2TxWriAuFXcwcjsu3huKyn39Q2id?usp=sharing)
+You can experiment with the library directly in **Google Colab** [here](https://colab.research.google.com/drive/1hx5Y2TxWriAuFXcwcjsu3huKyn39Q2id?usp=sharing)
 
 Here's a [writeup](https://open.substack.com/pub/ambikasukla/p/efficient-rag-with-document-layout?r=ft8uc&utm_campaign=post&utm_medium=web) explaining the problem and our approach. 
 
 Here'a LlamaIndex [blog](https://medium.com/@kirankurup/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125) explaining the need for smart chunking. 
 
-API Reference: [https://llmsherpa.readthedocs.io/](https://llmsherpa.readthedocs.io/)
+**API Reference**: [https://llmsherpa.readthedocs.io/](https://llmsherpa.readthedocs.io/)
+
+[How to use with Google Gemini Pro](https://medium.com/nlmatics/using-google-gemini-pro-with-your-pdfs-7c191a2fcd98)
+[How to use with Cohere Embed3](https://medium.com/nlmatics/ask-your-pdf-with-cohere-embed-v3-3eb5dab36945)
 
 ### Important Notes
 
  * The LayoutPDFReader is tested on a wide variety of PDFs. That being said, it is still challenging to get every PDF parsed correctly.
 * OCR is currently not supported. Only PDFs with a text layer are supported.
 
 > [!NOTE]
-> LLMSherpa uses a free and open api server. The server does not store your PDFs except for temporary storage during parsing.
+> LLMSherpa uses a free and open api server. The server does not store your PDFs except for temporary storage during parsing. This server will be decommissioned soon. 
+> Self-host your own private server using instructions at [https://github.com/nlmatics/nlm-ingestor](https://github.com/nlmatics/nlm-ingestor)
 
 > [!IMPORTANT]
-> Private hosting is now available via [Microsoft Azure Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nlmaticscorp1686371242615.layout_pdf_parser?tab=Overview)!
-
-
-*For on premise hosting options, premium support or custom license options, create a custom licensing ticket [here](https://nlmatics.atlassian.net/servicedesk/customer/portals).*
+> Private available at [Microsoft Azure Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nlmaticscorp1686371242615.layout_pdf_parser?tab=Overview) 
+> will be decommissioned soon. Please move to your self-hosted instance using instructions at [https://github.com/nlmatics/nlm-ingestor](https://github.com/nlmatics/nlm-ingestor).
 
 
 ### Installation

diff --git a/llmsherpa/readers/layout_reader.py b/llmsherpa/readers/layout_reader.py
@@ -14,9 +14,11 @@ class Block:
     block_idx: int
         id of the block as returned from the server. It starts from 0 and is -1 if the id is not available
     top: float
-        top position of the block in the page and it is -1 if the position is not available
+        top position of the block in the page and it is -1 if the position is not available - only available for tables
     left: float
-        left position of the block in the page and it is -1 if the position is not available
+        left position of the block in the page and it is -1 if the position is not available - only available for tables
+    bbox: [float]
+        bounding box of the block in the page and it is [] if the bounding box is not available
     sentences: list
         list of sentences in the block
     children: list
@@ -34,6 +36,7 @@ def __init__(self, block_json=None):
         self.block_idx = block_json['block_idx'] if block_json and 'block_idx' in block_json else -1
         self.top = block_json['top'] if block_json and 'top' in block_json else -1
         self.left = block_json['left'] if block_json and 'left' in block_json else -1
+        self.bbox = block_json['bbox'] if block_json and 'bbox' in block_json else []
         self.sentences = block_json['sentences'] if block_json and 'sentences' in block_json else []
         self.children = []
         self.parent = None
@@ -518,3 +521,21 @@ def sections(self):
         Returns all the sections in the document. This is useful for getting all the sections in a document.
         """
         return self.root_node.sections()
+    def to_text(self):
+        """
+        Returns text of a document by iterating through all the sections '\n'
+        """
+        text = ""
+        for section in self.sections():
+            text = text + section.to_text(include_children=True, recurse=True) + "\n"
+        return text
+
+    def to_html(self):
+        """
+        Returns html for the document by iterating through all the sections
+        """
+        html_str = "<html>"
+        for section in self.sections():
+            html_str = html_str + section.to_html(include_children=True, recurse=True)
+        html_str = html_str + "</html>"
+        return html_str
diff --git a/setup.py b/setup.py
@@ -2,7 +2,7 @@
 
 setup(
     name='llmsherpa',
-    version='0.1.3',    
+    version='0.1.4',    
     description='Strategic APIs to Accelerate LLM Use Cases',
     long_description=open('README.md').read(),
     long_description_content_type='text/markdown',