Skip to content

DocuParse is a high-performance tool for converting PDF documents into clean, structured Markdown files. Designed for speed and accuracy, it extracts and formats content while minimizing errors like hallucinations and repetitions.

Notifications You must be signed in to change notification settings

MansurPro/DocuParse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 

Repository files navigation

πŸ“„ DocuParse

DocuParse is a high-performance tool for converting PDF documents into clean, structured Markdown files. Designed for speed and accuracy, it extracts and formats content while minimizing errors like hallucinations and repetitions.


πŸš€ Key Features

  • Multi-Format Support: Converts PDFs, EPUBs, and MOBIs into Markdown.
  • Accurate Layout Detection: Utilizes AI models to detect page layouts, columns, and format equations in LaTeX.
  • Enhanced Formatting: Cleans headers, footers, and artifacts, while preserving code blocks and tables.
  • Multi-Language Support: Processes documents in various languages, optimized for English, French, Spanish, and more.
  • Cloud-Based Processing: Leverages Google Colab for GPU-accelerated operations.

πŸ› οΈ How It Works

DocuParse is powered by a robust AI pipeline:

  1. Text Extraction: Extracts text with or without OCR as needed.
  2. Layout Analysis: Identifies and segments content using advanced AI models.
  3. Content Cleaning: Applies heuristics to clean and format content blocks.
  4. Post-Processing: Combines content blocks into a structured Markdown document.

πŸ“‚ Project Structure

DocuParse/
β”‚
β”œβ”€β”€ scripts/             # Utility scripts for setup and processing
β”œβ”€β”€ data/                # Sample input and output files
β”œβ”€β”€ examples/            # Example documents and Markdown outputs
β”œβ”€β”€ models/              # AI models used for text extraction and layout analysis
└── README.md            # Project documentation

πŸ–₯️ Usage

1. Clone the Repository

git clone https://github.com/MansurPro/DocuParse.git
cd DocuParse

2. Set Up the Environment

Run DocuParse in Google Colab for GPU-accelerated processing. Install the required Python packages:

pip install -r requirements.txt

3. Convert a Single File

python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10

4. Convert Multiple Files

python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10

🎨 Examples

Input (PDF) Output (Markdown)
Textbook: Think Python View
Scientific Paper: Switch Transformers View

πŸ“Š Performance

  • Speed: DocuParse processes documents up to 10x faster than similar tools.
  • Accuracy: Minimizes hallucinations and ensures well-structured Markdown outputs.
  • GPU Utilization: Efficiently leverages GPU resources for parallel processing.

πŸ“œ License

This project is licensed under the MIT License. See the LICENSE file for details.


πŸ™Œ Acknowledgments

This project was made possible by incredible open-source models and datasets, including:

  • Tesseract OCR for text recognition.
  • HuggingFace Transformers for layout and content analysis.
  • LaTeX for equation formatting.

Thank you to the open-source community for their invaluable contributions!

About

DocuParse is a high-performance tool for converting PDF documents into clean, structured Markdown files. Designed for speed and accuracy, it extracts and formats content while minimizing errors like hallucinations and repetitions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published