Skip to content

Commit

Permalink
Merge pull request #20 from nlmatics/ambika
Browse files Browse the repository at this point in the history
Added documentation and metadata
  • Loading branch information
kiran-nlmatics authored Nov 1, 2023
2 parents 3ebef24 + c383d5a commit f3cdc25
Show file tree
Hide file tree
Showing 15 changed files with 438 additions and 76 deletions.
Binary file added .DS_Store
Binary file not shown.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ __pycache__/

# Distribution / packaging
.Python
_build/
_static/
_templates/
build/
develop-eggs/
dist/
Expand Down
35 changes: 35 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Read the Docs configuration file for Sphinx projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
# You can also specify other tool versions:
# nodejs: "20"
# rust: "1.70"
# golang: "1.20"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/conf.py
# You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
# builder: "dirhtml"
# Fail on all warnings to avoid broken references
# fail_on_warning: true

# Optionally build your docs in additional formats such as PDF and ePub
# formats:
# - pdf
# - epub

# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
# python:
# install:
# - requirements: docs/requirements.txt
96 changes: 50 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ You can experiment with the library directly in Google Colab [here](https://cola

Here's a [writeup](https://open.substack.com/pub/ambikasukla/p/efficient-rag-with-document-layout?r=ft8uc&utm_campaign=post&utm_medium=web) explaining the problem and our approach.

Here'a another [blog](https://medium.com/@kirankurup/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125) explaining the solution.
Here'a LlamaIndex [blog](https://medium.com/@kirankurup/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125) explaining the need for smart chunking.

API Reference: [https://llmsherpa.readthedocs.io/](https://llmsherpa.readthedocs.io/)

### Installation

Expand Down Expand Up @@ -64,6 +66,53 @@ pip install llama-index
import openai
openai.api_key = #<Insert API Key>
```

### Vector search and Retrieval Augmented Generation with Smart Chunking

LayoutPDFReader does smart chunking keeping related text due to document structure together:

* All list items are together including the paragraph that precedes the list.
* Items in a table are chuncked together
* Contextual information from section headers and nested section headers is included

The following code creates a LlamaIndex query engine from LayoutPDFReader document chunks

```python
from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()
```

Let's run one query:

```python
response = query_engine.query("list all the tasks that work with bart")
print(response)
```

We get the following response:

```
BART works well for text generation, comprehension tasks, abstractive dialogue, question answering, and summarization tasks.
```

Let's try another query that needs answer from a table:

```python
response = query_engine.query("what is the bart performance score on squad")
print(response)
```

Here's the response we get:

```
The BART performance score on SQuAD is 88.8 for EM and 94.6 for F1.
```

### Summarize a Section using prompts

LayoutPDFReader offers powerful ways to pick sections and subsections from a large document and use LLMs to extract insights from a section.
Expand Down Expand Up @@ -179,51 +228,6 @@ R1 of BART for different datasets:
- For the XSum dataset, the R1 score of BART is 45.14.
```

### Vector search and Retrieval Augmented Generation with Smart Chunking

LayoutPDFReader does smart chunking keeping the integrity of related text together:

* All list items are together including the paragraph that precedes the list.
* Items in a table are chuncked together
* Contextual information from section headers and nested section headers is included

The following code creates a LlamaIndex query engine from LayoutPDFReader document chunks

```python
from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()
```

Let's run one query:

```python
response = query_engine.query("list all the tasks that work with bart")
print(response)
```

We get the following response:

```
BART works well for text generation, comprehension tasks, abstractive dialogue, question answering, and summarization tasks.
```

Let's try another query that needs answer from a table:

```python
response = query_engine.query("what is the bart performance score on squad")
print(response)
```

Here's the response we get:

```
The BART performance score on SQuAD is 88.8 for EM and 94.6 for F1.
```

### Get the Raw JSON

Expand Down
36 changes: 36 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import os
import sys
sys.path.insert(0, os.path.abspath('../'))
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'LLM Sherpa'
copyright = '2023, Ambika Sukla'
author = 'Ambika Sukla'
release = '0.1.3'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = [
'sphinx.ext.doctest',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.napoleon',
]

templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']



# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'alabaster'
html_static_path = ['_static']
20 changes: 20 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
.. LLM Sherpa documentation master file, created by
sphinx-quickstart on Wed Nov 1 09:09:16 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to LLM Sherpa's documentation!
======================================

.. toctree::
:maxdepth: 2
:caption: Contents:

API reference <modules>

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
29 changes: 29 additions & 0 deletions docs/llmsherpa.readers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
llmsherpa.readers package
=========================

Submodules
----------

llmsherpa.readers.file\_reader module
-------------------------------------

.. automodule:: llmsherpa.readers.file_reader
:members:
:undoc-members:
:show-inheritance:

llmsherpa.readers.layout\_reader module
---------------------------------------

.. automodule:: llmsherpa.readers.layout_reader
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------

.. automodule:: llmsherpa.readers
:members:
:undoc-members:
:show-inheritance:
18 changes: 18 additions & 0 deletions docs/llmsherpa.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
llmsherpa package
=================

Subpackages
-----------

.. toctree::
:maxdepth: 4

llmsherpa.readers

Module contents
---------------

.. automodule:: llmsherpa
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/modules.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
llmsherpa
=========

.. toctree::
:maxdepth: 4

llmsherpa
Binary file added llmsherpa/.DS_Store
Binary file not shown.
2 changes: 1 addition & 1 deletion llmsherpa/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@
APIs to accelerate LLM use cases.
"""

__version__ = "0.1.2"
__version__ = "0.1.3"
__author__ = 'Ambika Sukla'
__credits__ = 'NLMATICS CORP.'
44 changes: 29 additions & 15 deletions llmsherpa/readers/file_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,30 @@
from llmsherpa.readers import Document

class LayoutPDFReader:
"""
Reads PDF content and understands hierarchical layout of the document sections and structural components such as paragraphs, sentences, tables, lists, sublists
Parameters
----------
parser_api_url: str
API url for LLM Sherpa. Use customer url for your private instance here
"""
def __init__(self, parser_api_url):
"""
Constructs a LayoutPDFReader from a parser endpoint.
Parameters
----------
parser_api_url: str
API url for LLM Sherpa. Use customer url for your private instance here
"""
self.parser_api_url = parser_api_url
self.download_connection = urllib3.PoolManager()
self.api_connection = urllib3.PoolManager()

def _download_pdf(self, pdf_url):

# some servers only allow browers user_agent to download
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
# add authorization headers if using external API (see upload_pdf for an example)
Expand All @@ -28,6 +46,16 @@ def _parse_pdf(self, pdf_file):
return parser_response

def read_pdf(self, path_or_url, contents=None):
"""
Reads pdf from a url or path
Parameters
----------
path_or_url: str
path or url to the pdf file e.g. https://someexapmple.com/myfile.pdf or /home/user/myfile.pdf
contents: bytes
contents of the pdf file. If contents is given, path_or_url is ignored. This is useful when you already have the pdf file contents in memory such as if you are using streamlit or flask.
"""
# file contents were given
if contents is not None:
pdf_file = (path_or_url, contents, 'application/pdf')
Expand All @@ -43,18 +71,4 @@ def read_pdf(self, path_or_url, contents=None):
parser_response = self._parse_pdf(pdf_file)
response_json = json.loads(parser_response.data.decode("utf-8"))
blocks = response_json['return_dict']['result']['blocks']
return Document(blocks)
# def read_file(file_path):

def main():
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"
pdf_url = "/Users/ambikasukla/Documents/1910.13461.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)
print(doc.sections()[5].to_html(include_children=True, recurse=True))

# Using the special variable
# __name__
if __name__=="__main__":
main()
return Document(blocks)
Loading

0 comments on commit f3cdc25

Please sign in to comment.