This Python script extracts text from various document formats and splits them into segments using the unstructured-io
library.
- Extracts text from PDF, MOBI, EPUB, and DJVU files.
- Splits documents into segments based on specified settings.
- Saves extracted segments as JSON files.
- Python 3.x
- unstructured-io (with
all-docs
package)
- Install dependencies:
pip install unstructured[all-docs]
- Clone or download the repository.
- Run the script:
python bulk_text_extractor.py <directory>
- Replace
<directory>
with the path to your documents directory.
- Replace
- The script will process each document and save extracted segments in a subdirectory within the specified directory.
- You can modify the
BulkTextExtract
class to customize settings like chunking strategy, page break handling, etc. - Refer to the
unstructured-io
documentation for more advanced functionalities.
MIT License