Table of Contents
Overview
Setup
- Prerequisites
- Installation
Project Structure
LLM Class Documentation
Comparison of Large Language Models
Collection Class Documentation
Summarizer Class Documentation
PDF Text Extraction Function Documentation
TextPreProcessor Class Documentation
Comparison of Different Text Processing Methods
RAG Class Documentation
Comparison of different EFs with different LLMs in RAG Class

Overview

This project is designed to demonstrate the use of various NLP (Natural Language Processing) models for text generation, summarization, and information retrieval. It utilizes models from Hugging Face's Transformers library and integrates with ChromaDB for efficient context retrieval. The project includes classes and methods for:

Loading and using different language models (LLMs) like GPT-2, T5, BERT, GPT-Neo, and others.
Generating text based on input queries and context.
Summarizing large text documents.
Extracting and processing text from PDF files.
Storing and retrieving context data using a vector database.

Setup

Prerequisites

Ensure you have Python installed on your system. This project requires the following Python packages:

chromadb
sentence-transformers
pymupdf
huggingface_hub
transformers
torch
matplotlib
nltk

Installation

Clone the repository or download the project files.
Install the required packages:

pip install -r requirements.txt

Project Structure

main.ipynb: Main Jupyter notebook containing the project code.
models/: Directory to store downloaded and saved models.
tests/: Directory containing test PDF files.

LLM Class Documentation

Class Overview
The LLM class provides a flexible interface for loading, managing, and utilizing various large language models (LLMs) such as GPT-2, T5, BERT, and others. This class handles device selection, model loading, and memory management, allowing for both online and offline model usage.

Supported Models

The LLM class currently supports the following models:

GPT-2
T5
BERT
DistilBERT
GPT-Neo
Gemma

model_classes
A dictionary mapping LLM types to their respective tokenizer class, model class, and model path.

Initialization

LLM(llm_type: str, load_online=False, save_model=False)

llm_type: The type of language model to load (e.g., 'gpt2', 't5').
load_online: If True, the model is loaded from an online source (e.g., Hugging Face Hub). If False, the model is loaded from a local directory.
save_model: If True, the loaded model and tokenizer are saved to the local direc## tory.

Methods

load_llm

Loads the specified language model and tokenizer.

load_llm(llm_type: str, load_online: bool, save_model: bool)

llm_type: Type of LLM to load.
load_online: If True, loads the model from an online source.
save_model: If True, saves the model and tokenizer locally.

Returns: A tuple containing the loaded tokenizer and model.

select_device

Selects the appropriate device for running the model ('cuda' if GPU is available, otherwise 'cpu').

@staticmethod
select_device() -> str

Returns: A string indicating the device ('cuda' or 'cpu').

generate_text

A placeholder method to be implemented by subclasses for generating text based on input.

generate_text(input_text: str, context: str = '') -> str

input_text: The input text for the model.
context: Optional context for text generation.

Raises: NotImplementedError

free_memory

Frees up memory by deleting the model and tokenizer and clearing the cache.

free_memory()

Usage Example

Here's a step-by-step example of how to use the LLM class:

Initialize the Model & Implement the generate_text function:

class GPT2(LLM):
    def __init__(self, load_online=False, save_model=False):
        super().__init__('gpt2', load_online, save_model)
        self.max_length = 1024

    def generate_text(self, input_text: str, context: str = '') -> str:
        max_context_length = self.max_length - len("Context: \nQuestion: \nAnswer:")
        context = context[:max_context_length]

        prompt = f"Context: {context}\nQuestion: {input_text}\nAnswer:"

        inputs = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)
        outputs = self.model.generate(
            inputs,
            max_new_tokens=150,
            temperature=0.7,
            top_p=0.9,
            top_k=50,
            num_return_sequences=1,
            pad_token_id=self.tokenizer.eos_token_id,
            do_sample=True
        )

        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = response.replace(prompt, '').strip()
        return response.split('\n')[0]

Generate Text

llm = GPT2()
question = "What is the capital of Iran?"
context = 'some text ...'
output = llm.generate_text(question, context)
print(output)

Free Memory
```
llm.free_memory()
```

Notes

Model Classes: The model_classes dictionary maps LLM types to their respective tokenizer class, model class, and model path, ensuring the correct components are loaded for each model type.
Initialization: The constructor (__init__) sets up the device, loads the specified model and tokenizer, and optionally saves the model locally.
Device Selection: The select_device method automatically selects 'cuda' if a GPU is available, otherwise defaults to 'cpu'.
Text Generation: The generate_text method is a placeholder and should be implemented by subclasses to define specific text generation behavior.
Memory Management: The free_memory method helps manage memory by cleaning up model and tokenizer instances and clearing the GPU cache if necessary.

By following these steps and methods, you can effectively manage and utilize various large language models for different natural language processing tasks.

Comparison of Large Language Models

Overview

The LLM class supports a range of large language models, each tailored for specific natural language processing tasks. This comparison provides an overview of different models, their unique characteristics, typical use cases, and effectiveness, helping you choose the right model for your needs.

Comparison Table

Model	Description	Use Case	Effectiveness
`gpt2`	GPT-2 is an open-source language model developed by OpenAI with a 1.5 billion parameter version available.	Suitable for general-purpose text generation	High
`t5`	T5 (Text-to-Text Transfer Transformer) converts all NLP tasks into a text-to-text format.	Excellent for translation, summarization, and Q&A	High
`bert`	BERT (Bidirectional Encoder Representations from Transformers) is designed to pre-train deep bidirectional representations.	Best for question answering and text classification	High
`distil-bert`	A smaller, faster, cheaper version of BERT that retains 97% of its language understanding capabilities.	Ideal for applications requiring faster inference	Medium
`gpt-neo`	An open-source model developed by EleutherAI, comparable to GPT-3 in terms of architecture.	Great for large-scale text generation and completion	High
`gemma`	A state-of-the-art model developed by Google, designed for efficient large-scale language tasks.	Best for complex and high-accuracy requirements	High
`llama`	A variant optimized for efficiency and speed, developed by OpenBMB.	Ideal for tasks requiring quick responses	Medium
`textbase`	A model designed for foundational text understanding and generation tasks.	Good for foundational NLP tasks	Medium

Overview of Model Characteristics

GPT-2: Known for its robust performance in general text generation tasks. It is widely used for applications such as text completion and story generation.
T5: This versatile model excels in tasks that can be framed as text-to-text transformations, including translation, summarization, and question answering.
BERT: A powerful model for tasks that require deep understanding of text, like question answering and text classification.
DistilBERT: Provides a good balance between performance and speed, making it suitable for applications where inference time is critical.
GPT-Neo: Offers capabilities similar to GPT-3 and is suitable for generating large amounts of text and completing complex text inputs.
Gemma: Optimized for high-accuracy requirements and complex tasks, it is ideal for advanced NLP applications.
Llama: Focuses on efficiency and speed, making it a good choice for applications that need quick responses.
TextBase: A foundational model useful for a broad range of basic NLP tasks.

This comprehensive comparison aids in selecting the most suitable large language model based on specific requirements and application contexts.

Collection Class Documentation

Class Overview
The Collection class provides an interface for creating and managing a collection of documents with vector embeddings, utilizing a transformer model for encoding text data. This class integrates with ChromaDB to add and retrieve contexts efficiently.

Initialization

Collection(collection_name: str, transformer_type: str = 'all-MiniLM-L6-v2', load_online=False, save_transformer=False)

collection_name: The name of the collection to create or use.
transformer_type: The type of transformer model to use for encoding (default is 'all-MiniLM-L6-v2').
load_online: If True, the transformer model is loaded from an online source. If False, the model is loaded from a local directory.
save_transformer: If True, the transformer model is saved to the local directory.

Methods

load_sentence_transformer

Loads the specified sentence transformer model.

load_sentence_transformer(transformer_type: str, load_online: bool, save_transformer: bool)

transformer_type: Type of transformer model to load.
load_online: If True, loads the model from an online source.
save_transformer: If True, saves the model locally.

Returns: The loaded sentence transformer model.

add_contexts

Adds context documents to the collection with vector embeddings.

add_contexts(context_data: list)

context_data: A list of context documents to add to the collection.

retrieve_contexts

Retrieves the most relevant context documents based on a query.

retrieve_contexts(question: str, top_n: int = 1)

question: The query for which to find relevant context documents.
top_n: The number of top results to retrieve (default is 1).

Returns: A list of the top n most relevant context documents.

Usage Example

Here's a step-by-step example of how to use the Collection class:

Initialize the Collection:
```
collection = Collection('rag')
```

Add Contexts to the Collection:

context_data = [
    "The capital of France is Paris. It is known for its art, culture, and cuisine.",
    "The Great Wall of China is one of the greatest wonders of the world.",
    "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South America.",
    "The Amazon rainforest is a moist broadleaf forest that covers most of the Amazon basin of South Asia."
]
collection.add_contexts(context_data)

Retrieve Relevant Contexts:

response = collection.retrieve_contexts('amazon', top_n=2)
print(response)

Notes

Initialization: The constructor checks if the specified collection exists and deletes it if it does before creating a new collection.
Loading Transformers: The load_sentence_transformer method handles both online and offline loading of the transformer model, and optionally saves the model locally.
Adding Contexts: The add_contexts method encodes the context data into vectors and adds them to the collection with unique IDs.
Retrieving Contexts: The retrieve_contexts method queries the collection with a vectorized question and retrieves the most relevant documents.

By following these steps and methods, you can effectively manage and utilize a collection of context documents with vector embeddings for various applications.

Summarizer Class Documentation

Class Overview
The Summarizer class provides a flexible interface for loading, managing, and utilizing various text summarization models such as T5, BART, and Pegasus. This class handles device selection, model loading, and memory management, allowing for both online and offline model usage.

Supported Models

The Summarizer class currently supports the following models:

T5
BART
Pegasus

summarizer_models
A dictionary mapping summarizer model types to their respective tokenizer class, model class, and model path.

Initialization

Summarizer(summarizer_model: str = 't5', load_online=False, save_model=False)

summarizer_model: The type of summarizer model to load (e.g., 't5', 'bart').
load_online: If True, the model is loaded from an online source (e.g., Hugging Face Hub). If False, the model is loaded from a local directory.
save_model: If True, the loaded model and tokenizer are saved to the local directory.

Methods

load_summarizer

Loads the specified summarizer model and tokenizer.

load_summarizer(summarizer_model: str, load_online: bool, save_model: bool)

summarizer_model: Type of summarizer model to load.
load_online: If True, loads the model from an online source.
save_model: If True, saves the model and tokenizer locally.

Returns: A tuple containing the loaded tokenizer and model.

select_device

Selects the appropriate device for running the model ('cuda' if GPU is available, otherwise 'cpu').

@staticmethod
select_device() -> str

Returns: A string indicating the device ('cuda' or 'cpu').

summarize_text

A placeholder method to be implemented by subclasses for summarizing text based on input.

summarize_text(input_text: str, context: str = '') -> str

input_text: The input text to be summarized.
context: Optional context for text summarization.

Raises: NotImplementedError

free_memory

Frees up memory by deleting the model and tokenizer and clearing the cache.

free_memory()

Usage Example

Here's a step-by-step example of how to use the Summarizer class:

Implement a subclass and the summarize_text method:

class T5_Summarizer(Summarizer):
    def __init__(self, load_online=False, save_model=False):
        super().__init__('t5', load_online, save_model)

    def summarize_text(self, input_text: str) -> str:
        inputs = self.tokenizer(input_text, return_tensors='pt', max_length=1024, truncation=True)
        max_length = min(len(input_text.split()), 150)  # Adjust max length based on input length
        min_length = min(len(input_text.split()) // 5, 40)  # Adjust min length based on input length
        summary_ids = self.model.generate(inputs['input_ids'], max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)
        summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        return summary

Initialize the Model:

summarizer = T5_Summarizer(load_online=True)

Summarize Text:

context = """
large amount of text ....
"""
summary = summarizer.summarize_text(context)
print(summary)

Notes

Model Classes: The summarizer_models dictionary maps summarizer model types to their respective tokenizer class, model class, and model path, ensuring the correct components are loaded for each model type.
Initialization: The constructor (__init__) sets up the device, loads the specified model and tokenizer, and optionally saves the model locally.
Device Selection: The select_device method automatically selects 'cuda' if a GPU is available, otherwise defaults to 'cpu'.
Text Summarization: The summarize_text method is a placeholder and should be implemented by subclasses to define specific text summarization behavior.
Memory Management: The free_memory method helps manage memory by cleaning up model and tokenizer instances and clearing the GPU cache if necessary.

By following these steps and methods, you can effectively manage and utilize various text summarization models for different natural language processing tasks.

PDF Text Extraction Function Documentation

Function Overview
The extract_text_from_pdf function extracts text from a given PDF file. It iterates through each page of the PDF and concatenates the extracted text into a single string.

Function Definition

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text("text")
    return text

pdf_path: The file path to the PDF from which to extract text.

Returns: A string containing the concatenated text extracted from all pages of the PDF.

Usage Example

Here's an example of how to use the extract_text_from_pdf function:

pdf_path = f"{project_path}/tests/micro led 1.pdf"
text = extract_text_from_pdf(pdf_path)
print(text)

Notes

Library Dependency: This function uses the PyMuPDF library (imported as fitz) for PDF processing. Ensure you have installed the library using pip install pymupdf.
Text Extraction: The function extracts text from each page of the PDF and concatenates it into a single string. The get_text("text") method extracts the plain text from each page.
File Path: The pdf_path parameter should be the complete path to the PDF file from which you want to extract text.

By following these steps and utilizing this function, you can efficiently extract text from PDF files for further processing or analysis.

TextPreProcessor Class Documentation

Class Overview
The TextPreProcessor class provides various methods for cleaning and chunking text. It supports multiple chunking strategies such as by sentence, word count, character count, recursive split, and a custom preprocessing method. This class is useful for preparing text data for tasks like language modeling and text generation.

Initialization

TextPreProcessor(method='sentence', chunk_size=500)

method: The chunking method to use (e.g., 'sentence', 'word_count', 'char_count', 'recursive', 'custom').
chunk_size: The size of chunks to create, depending on the method.

Methods

clean_text

Cleans the input text by removing multiple newlines and spaces.

clean_text(text)

text: The input text to clean.

Returns: Cleaned text as a string.

split_by_sentence

Splits the text by sentences.

split_by_sentence(text)

text: The input text to split.

Returns: A list of sentences.

split_by_word_count

Splits the text by word count.

split_by_word_count(text, chunk_size=100)

text: The input text to split.
chunk_size: Number of words per chunk.