Skip to content

Commit

Permalink
Improve the READMEs and add documentation for ADI Skill (#7)
Browse files Browse the repository at this point in the history
* Progress on readmes

* Add comprehensive documentation for ADI

* Fix json
  • Loading branch information
BenConstable9 committed Sep 9, 2024
1 parent 02bc2d9 commit 721851f
Show file tree
Hide file tree
Showing 5 changed files with 216 additions and 3 deletions.
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,22 @@
# Text2SQL and Image Processing in AI Search

This repo provides sample code for improving RAG applications with rich data sources.
This repo provides sample code for improving RAG applications with rich data sources including SQL Warehouses and documents analysed with Azure Document Intelligence.

It is intended that the plugins and skills provided in this repository, are adapted and added to your new or existing RAG application to improve the response quality.

## Components

- `./text2sql` contains an Multi-Shot implementation for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base.
- `./ai_search_with_adi` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models to interpret and understand these.
- `./ai_search_with_adi` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these.

The above components have been successfully used on production RAG projects to increase the quality of responses. The code provided in this repo is a sample of the implementation and should be adjusted before being used in production.

## High Level Implementation

The following diagram shows a workflow for how the Text2SQL and AI Search plugin would be incorporated into a RAG application. Using the plugins available, alongside the Function Calling capabilities of LLMs, the LLM can do Chain of Thought reasoning to determine the steps needed to answer the question. This allows the LLM to recognise intent and therefore pick appropriate data sources based on the intent of the question, or a combination of both.

![High level workflow for a plugin driven RAG application](./images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow")

## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Expand Down
196 changes: 196 additions & 0 deletions ai_search_with_adi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# AI Search Indexing with Azure Document Intelligence

This portion of the repo contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these.

The implementation in Python, although it can easily be adapted for C# or another language. The code is designed to run in an Azure Function App inside the tenant.

**This approach makes use of Azure Document Intelligence v4.0 which is still in preview.**

## High Level Workflow

A common way to perform document indexing, is to either extract the text content or use [optical character recognition](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr) to gather the text content before indexing. Whilst this works well for simple files that contain mainly text based information, the response quality diminishes significantly when the documents contain mainly charts and images, such as a PowerPoint presentation.

To solve this issue and to ensure that good quality information is extracted from the document, an indexer using [Azure Document Intelligence (ADI)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0) is developed with [Custom Skills](https://learn.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-web-api):

![High level workflow for indexing with Azure Document Intelligence based skills](./images/Indexing%20vs%20Indexing%20with%20ADI.png "Indexing with Azure Document Intelligence Approach")

Instead of using OCR to extract the contents of the document, ADIv4 is used to analyse the layout of the document and convert it to a Markdown format. The Markdown format brings benefits such as:

- Table layout
- Section and header extraction with Markdown headings
- Figure and image extraction

Once the Markdown is obtained, several steps are carried out:

1. **Extraction of images / charts**. The figures identified are extracted from the original document and passed to a multi-modal model (gpt4o in this case) for analysis. We obtain a description and summary of the chart / image to infer the meaning of the figure. This allows us to index and perform RAG analysis the information that is visually obtainable from a chart, without it being explicitly mentioned in the text surrounding. The information is added back into the original chart.

2. **Extraction of sections and headers**. The sections and headers are extracted from the document and returned additionally to the indexer under a separate field. This allows us to store them as a separate field in the index and therefore surface the most relevant chunks.

3. **Cleaning of Markdown**. The final markdown content is cleaned of any characters or unsupported Markdown elements that we do not want in the chunk e.g. non-relevant images.

Page wise analysis in ADI is used to avoid splitting tables / figures across multiple chunks, when the chunking is performed.

The properties returned from the ADI Custom Skill are then used to perform the following skills:

- Pre-vectorisation cleaning
- Keyphrase extraction
- Vectorisation

## Provided Notebooks \& Utilities

- `./ai_search.py`, `./deployment.py` provide an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search.
- `./function_apps/indexer` provides a pre-built Python function app that communicates with Azure Document Intelligence, Azure OpenAI etc to perform the Markdown conversion, extraction of figures, figure understanding and corresponding cleaning of Markdown.
- `./rag_with_ai_search.ipynb` provides example of how to utilise the AI Search plugin to query the index.

## ADI Custom Skill

Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.

To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.

### function_app.py

`./function_apps/indexer/function_app.py` contains the HTTP entrypoints for the ADI skill and the other provided utility skills.

### adi_2_aisearch

`./function_apps/indexer/adi_2_aisearch.py` contains the methods for content extraction with ADI. The key methods are:

#### analyse_document

This method takes the passed file, uploads it to ADI and retrieves the Markdown format.

#### process_figures_from_extracted_content

This method takes the detected figures, and crops them out of the page to save them as images. It uses the `understand_image_with_vlm` to communicate with Azure OpenAI to understand the meaning of the extracted figure.

`update_figure_description` is used to update the original Markdown content with the description and meaning of the figure.

#### clean_adi_markdown

This method performs the final cleaning of the Markdown contents. In this method, the section headings and page numbers are extracted for the content to be returned to the indexer.

### Input Format

The ADI Skill conforms to the [Azure AI Search Custom Skill Input Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-input-json-structure). AI Search will automatically build this format if you use the utility file provided in this repo to build your indexer and skillset.

```json
{
"values": [
{
"recordId": "0",
"data": {
"source": "<FULL URI TO BLOB>"
}
},
{
"recordId": "1",
"data": {
"source": "<FULL URI TO BLOB>"
}
}
]
}
```

### Output Format

The ADI Skill conforms to the [Azure AI Search Custom Skill Output Format](https://learn.microsoft.com/en-gb/azure/search/cognitive-search-custom-skill-web-api?WT.mc_id=Portal-Microsoft_Azure_Search#sample-output-json-structure).

If `chunk_by_page` header is `True` (recommended):

```json
{
"values": [
{
"recordId": "0",
"data": {
"extracted_content": [
{
"page_number": 1,
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 1>"
},
{
"page_number": 2,
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 2>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
}
]
}
},
{
"recordId": "1",
"data": {
"extracted_content": [
{
"page_number": 1,
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
},
{
"page_number": 2,
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR PAGE NUMBER 1>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR PAGE NUMBER 2>"
}
]
}
}
]
}
```

If `chunk_by_page` header is `False`:

```json
{
"values": [
{
"recordId": "0",
"data": {
"extracted_content": {
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
}
}
},
{
"recordId": "1",
"data": {
"extracted_content": {
"sections": [
"<LIST OF DETECTED HEADINGS AND SECTIONS FOR THE ENTIRE DOCUMENT>"
],
"content": "<CLEANED MARKDOWN CONTENT FOR THE ENTIRE DOCUMENT>"
}
}
}
]
}
```

**Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**


## Production Considerations

Below are some of the considerations that should be made before using this custom skill in production:

- This approach makes use of Azure Document Intelligence v4.0 which is still in preview. Features may change before the GA release. ADI v4.0 preview is only available in select regions.
- Azure Document Intelligence output quality varies significantly by file type. A PDF file type will producer richer outputs in terms of figure detection etc, compared to a PPTX file in our testing.

## Possible Improvements

Below are some possible improvements that could be made to the vectorisation approach:

- Storing the extracted figures in blob storage for access later. This would allow the LLM to resurface the correct figure or provide a link to the give in the reference system to be displayed in the UI.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
9 changes: 8 additions & 1 deletion text2sql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The sample provided works with Azure SQL Server, although it has been easily ada

The following diagram shows a workflow for how the Text2SQL plugin would be incorporated into a RAG application. Using the plugins available, alongside the [Function Calling](https://platform.openai.com/docs/guides/function-calling) capabilities of LLMs, the LLM can do [Chain of Thought](https://learn.microsoft.com/en-us/dotnet/ai/conceptual/chain-of-thought-prompting) reasoning to determine the steps needed to answer the question. This allows the LLM to recognise intent and therefore pick appropriate data sources based on the intent of the question.

![High level workflow for a plugin driven RAG application](./images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow")
![High level workflow for a plugin driven RAG application](../images/Plugin%20Based%20RAG%20Flow.png "High Level Workflow")

## Why Text2SQL instead of indexing the database contents?

Expand Down Expand Up @@ -177,3 +177,10 @@ Below are some of the considerations that should be made before using this plugi
- Consider limiting the permissions of the identity or connection string to only allow access to certain tables or perform certain query types.
- If possible, run the queries under the identity of the end user so that any row or column level security is applied to the data.
- Consider data masking for sensitive columns that you do not wish to be exposed.

## Possible Improvements

Below are some possible improvements that could be made to the Text2SQL approach:

- Storing the entity names / definitions / selectors in a vector database and using a vector search to obtain the most relevant entities.
- Due to the small number of tokens that this approaches uses, this approach was not considered but if the number of tables is significantly larger, this approach may provide benefits in selecting the most appropriate tables (untested).

0 comments on commit 721851f

Please sign in to comment.