DocuParse is a high-performance tool for converting PDF documents into clean, structured Markdown files. Designed for speed and accuracy, it extracts and formats content while minimizing errors like hallucinations and repetitions.
- Multi-Format Support: Converts PDFs, EPUBs, and MOBIs into Markdown.
- Accurate Layout Detection: Utilizes AI models to detect page layouts, columns, and format equations in LaTeX.
- Enhanced Formatting: Cleans headers, footers, and artifacts, while preserving code blocks and tables.
- Multi-Language Support: Processes documents in various languages, optimized for English, French, Spanish, and more.
- Cloud-Based Processing: Leverages Google Colab for GPU-accelerated operations.
DocuParse is powered by a robust AI pipeline:
- Text Extraction: Extracts text with or without OCR as needed.
- Layout Analysis: Identifies and segments content using advanced AI models.
- Content Cleaning: Applies heuristics to clean and format content blocks.
- Post-Processing: Combines content blocks into a structured Markdown document.
DocuParse/
β
βββ scripts/ # Utility scripts for setup and processing
βββ data/ # Sample input and output files
βββ examples/ # Example documents and Markdown outputs
βββ models/ # AI models used for text extraction and layout analysis
βββ README.md # Project documentation
git clone https://github.com/MansurPro/DocuParse.git
cd DocuParse
Run DocuParse in Google Colab for GPU-accelerated processing. Install the required Python packages:
pip install -r requirements.txt
python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10
python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10
Input (PDF) | Output (Markdown) |
---|---|
Textbook: Think Python | View |
Scientific Paper: Switch Transformers | View |
- Speed: DocuParse processes documents up to 10x faster than similar tools.
- Accuracy: Minimizes hallucinations and ensures well-structured Markdown outputs.
- GPU Utilization: Efficiently leverages GPU resources for parallel processing.
This project is licensed under the MIT License. See the LICENSE
file for details.
This project was made possible by incredible open-source models and datasets, including:
- Tesseract OCR for text recognition.
- HuggingFace Transformers for layout and content analysis.
- LaTeX for equation formatting.
Thank you to the open-source community for their invaluable contributions!