Skip to content

Commit

Permalink
updated README with information about nlm-ingestor, added bbox to blo…
Browse files Browse the repository at this point in the history
…cks and added to_html, to_text to doc
  • Loading branch information
Ambika Sukla committed Jan 24, 2024
1 parent ca66df1 commit 35bd80f
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 10 deletions.
27 changes: 20 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,17 @@

LLM Sherpa provides strategic APIs to accelerate large language model (LLM) use cases.

## What's New

> [!IMPORTANT]
> llmsherpa back end service is now fully open sourced under Apache 2.0 Licence. See [https://github.com/nlmatics/nlm-ingestor](https://github.com/nlmatics/nlm-ingestor)
> - You can now run your own servers using a docker image!
> - Support for different file formats: DOCX, PPTX, HTML, TXT, XML
> - OCR Support is built in
> - Blocks now have co-ordinates - use bbox propery of blocks such as sections
> - A new indent parser to better align all headings in a document to their corresponding level
> - The free server and paid server are not updated with latest code and users are requested to spawn their own servers using instructions in nlm-ingestor
## LayoutPDFReader

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG).
Expand All @@ -19,27 +30,29 @@ LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layo

With LayoutPDFReader, developers can find optimal chunks of text to vectorize, and a solution for limited context window sizes of LLMs.

You can experiment with the library directly in Google Colab [here](https://colab.research.google.com/drive/1hx5Y2TxWriAuFXcwcjsu3huKyn39Q2id?usp=sharing)
You can experiment with the library directly in **Google Colab** [here](https://colab.research.google.com/drive/1hx5Y2TxWriAuFXcwcjsu3huKyn39Q2id?usp=sharing)

Here's a [writeup](https://open.substack.com/pub/ambikasukla/p/efficient-rag-with-document-layout?r=ft8uc&utm_campaign=post&utm_medium=web) explaining the problem and our approach.

Here'a LlamaIndex [blog](https://medium.com/@kirankurup/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125) explaining the need for smart chunking.

API Reference: [https://llmsherpa.readthedocs.io/](https://llmsherpa.readthedocs.io/)
**API Reference**: [https://llmsherpa.readthedocs.io/](https://llmsherpa.readthedocs.io/)

[How to use with Google Gemini Pro](https://medium.com/nlmatics/using-google-gemini-pro-with-your-pdfs-7c191a2fcd98)
[How to use with Cohere Embed3](https://medium.com/nlmatics/ask-your-pdf-with-cohere-embed-v3-3eb5dab36945)

### Important Notes

* The LayoutPDFReader is tested on a wide variety of PDFs. That being said, it is still challenging to get every PDF parsed correctly.
* OCR is currently not supported. Only PDFs with a text layer are supported.

> [!NOTE]
> LLMSherpa uses a free and open api server. The server does not store your PDFs except for temporary storage during parsing.
> LLMSherpa uses a free and open api server. The server does not store your PDFs except for temporary storage during parsing. This server will be decommissioned soon.
> Self-host your own private server using instructions at [https://github.com/nlmatics/nlm-ingestor](https://github.com/nlmatics/nlm-ingestor)
> [!IMPORTANT]
> Private hosting is now available via [Microsoft Azure Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nlmaticscorp1686371242615.layout_pdf_parser?tab=Overview)!

*For on premise hosting options, premium support or custom license options, create a custom licensing ticket [here](https://nlmatics.atlassian.net/servicedesk/customer/portals).*
> Private available at [Microsoft Azure Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nlmaticscorp1686371242615.layout_pdf_parser?tab=Overview)
> will be decommissioned soon. Please move to your self-hosted instance using instructions at [https://github.com/nlmatics/nlm-ingestor](https://github.com/nlmatics/nlm-ingestor).

### Installation
Expand Down
25 changes: 23 additions & 2 deletions llmsherpa/readers/layout_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,11 @@ class Block:
block_idx: int
id of the block as returned from the server. It starts from 0 and is -1 if the id is not available
top: float
top position of the block in the page and it is -1 if the position is not available
top position of the block in the page and it is -1 if the position is not available - only available for tables
left: float
left position of the block in the page and it is -1 if the position is not available
left position of the block in the page and it is -1 if the position is not available - only available for tables
bbox: [float]
bounding box of the block in the page and it is [] if the bounding box is not available
sentences: list
list of sentences in the block
children: list
Expand All @@ -34,6 +36,7 @@ def __init__(self, block_json=None):
self.block_idx = block_json['block_idx'] if block_json and 'block_idx' in block_json else -1
self.top = block_json['top'] if block_json and 'top' in block_json else -1
self.left = block_json['left'] if block_json and 'left' in block_json else -1
self.bbox = block_json['bbox'] if block_json and 'bbox' in block_json else []
self.sentences = block_json['sentences'] if block_json and 'sentences' in block_json else []
self.children = []
self.parent = None
Expand Down Expand Up @@ -518,3 +521,21 @@ def sections(self):
Returns all the sections in the document. This is useful for getting all the sections in a document.
"""
return self.root_node.sections()
def to_text(self):
"""
Returns text of a document by iterating through all the sections '\n'
"""
text = ""
for section in self.sections():
text = text + section.to_text(include_children=True, recurse=True) + "\n"
return text

def to_html(self):
"""
Returns html for the document by iterating through all the sections
"""
html_str = "<html>"
for section in self.sections():
html_str = html_str + section.to_html(include_children=True, recurse=True)
html_str = html_str + "</html>"
return html_str
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

setup(
name='llmsherpa',
version='0.1.3',
version='0.1.4',
description='Strategic APIs to Accelerate LLM Use Cases',
long_description=open('README.md').read(),
long_description_content_type='text/markdown',
Expand Down

0 comments on commit 35bd80f

Please sign in to comment.