LongChat-PDF: AI-Powered Conversational Interface for PDF Documents

Purpose

LongChat-PDF is an advanced conversational AI system designed to interact with users based on the content of any given PDF document. It leverages natural language processing and semantic search technologies to provide accurate, context-aware responses while maintaining long-term conversation coherence.

Key Features

PDF text extraction and preprocessing
Semantic chunking of document content
Advanced embedding generation for text segments
Efficient semantic search functionality
Long-term context management in conversations
Integration with OpenAI's GPT model for natural language generation

Detailed Approach

Text Extraction and Cleaning:
- Extract raw text from the input PDF using the fitz library.
- Clean and normalize the extracted text, removing irrelevant information and standardizing format.
Text Chunking:
- Split the cleaned text into semantically meaningful chunks.
- Optimize chunk size for balance between context preservation and search efficiency.
Embedding Creation:
- Generate vector representations (embeddings) for each text chunk using OpenAI's text-embedding-ada-002 model.
- Store embeddings alongside text chunks for efficient retrieval.
Knowledge Base Construction:
- Create a searchable knowledge base from the embedded text chunks.
- Implement efficient indexing for fast query processing.
Semantic Search:
- Convert user queries into embeddings.
- Use cosine similarity to identify the most relevant text chunks.
- Retrieve top-k most similar chunks for each query.
Conversational Interface:
- Maintain conversation history for context-aware responses.
- Combine relevant text chunks with conversation history.
- Use OpenAI's GPT model to generate coherent and informative responses.
Long-Term Context Management:
- Implement mechanisms to maintain context over extended conversations.
- Summarize and store key points from the ongoing dialogue.
- Periodically refresh context to keep responses relevant and accurate.

System Components

PDF Extractor (src/extract/pdf_extractor.py): Handles the extraction of text content from PDF files.
Text Preprocessor (src/preprocess/text_preprocessing.py): Cleans and normalizes the extracted text.
Text Chunker (src/preprocess/text_chunking.py): Splits preprocessed text into manageable, semantic chunks.
Embedding Generator (src/embeddings/create_embeddings.py): Creates vector representations of text chunks using OpenAI's API.
Semantic Search Engine (src/search/semantic_search.py): Implements similarity-based search functionality for finding relevant text chunks.
Knowledge Base (src/chatbot/knowledge_base.py): Manages the storage, indexing, and retrieval of embedded text chunks. Features include:
- Efficient loading of preprocessed and embedded chunks.
- Fast semantic similarity search for query processing.
- Caching mechanisms for improved response times.
- Extensible design for future updates and improvements.
Chatbot Core (src/chatbot/chatbot.py): Orchestrates the conversation flow, integrating all components to generate responses.

Setup and Usage

Environment Setup:

git clone https://github.com/yourusername/LongChat-PDF.git
cd LongChat-PDF
pip install -r requirements.txt

Configuration: Create a .env file in the project root:
```
OPENAI_API_KEY=your_api_key_here
```

PDF Processing:

python script/extract_text.py
python script/clean_text.py
python script/chunk_text.py
python script/create_embeddings.py

Launch Chatbot:
```
python script/run_chatbot.py
```
Interaction: Start asking questions about the content of your PDF document.

Future Enhancements

Multi-document support for simultaneous querying across multiple PDFs.
Integration of a web interface for easier interaction.
Implementation of active learning to improve response quality over time.
Support for additional document formats beyond PDF.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
script		script
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Readme.md		Readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongChat-PDF: AI-Powered Conversational Interface for PDF Documents

Purpose

Key Features

Detailed Approach

System Components

Setup and Usage

Future Enhancements

About

Releases

Packages

Languages

thurakhant/ContextualPDF-AI

Folders and files

Latest commit

History

Repository files navigation

LongChat-PDF: AI-Powered Conversational Interface for PDF Documents

Purpose

Key Features

Detailed Approach

System Components

Setup and Usage

Future Enhancements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages