bulkTextPartition/readme.md at master · FerrenF/bulkTextPartition · GitHub

Bulk Text Extractor

This Python script extracts text from various document formats and splits them into segments using the unstructured-io library.

Features

Extracts text from PDF, MOBI, EPUB, and DJVU files.
Splits documents into segments based on specified settings.
Saves extracted segments as JSON files.

Requirements

Python 3.x
unstructured-io (with all-docs package)

Usage

Install dependencies: pip install unstructured[all-docs]
Clone or download the repository.
Run the script: python bulk_text_extractor.py <directory>
- Replace <directory> with the path to your documents directory.
The script will process each document and save extracted segments in a subdirectory within the specified directory.

Options

You can modify the BulkTextExtract class to customize settings like chunking strategy, page break handling, etc.
Refer to the unstructured-io documentation for more advanced functionalities.

License

MIT License