Skip to content

sanskrit-coders/doc_curation

Repository files navigation

^Build status Documentation Status PyPI version

doc curation

A package for curating doc file collections. Prominent features:

  • Scrape texts off various sites, such as Wikisource. See example here. (PS: Consider contributing to raw_etexts repo. )
  • OCR some pdf with google drive. Automatically splits into 25 page bits and ocrs them individually. See usage example here, function here.

For users

Installation or upgrade

  • For stable version pip install doc_curation -U -e.[all]
  • For latest code pip install git+https://github.com/sanskrit-coders/doc_curation/@master -U -e.[all]
  • Web.

Usage

Google Drive API wrapper

  • Enable Google Drive API and download service account key file having Google Driver API access. (See details in split_and_ocr_on_drive function documentation (eg. github source).)
from doc_curation.pdf import drive_ocr
pdf_file = '/home/file.pdf'
key_file = '/home/key.json'
drive_ocr.split_and_ocr_on_drive(pdf_path=pdf_file, google_key=key_file, small_pdf_pages=5)

Command line invocation:

# For help and details - 
/usr/bin/python3 -m doc_curation.pdf.drive_ocr --help
/usr/bin/python3 -m doc_curation.pdf.drive_ocr --input_path=/some/Dir/Or/File --google_key=/some/path/service_account_key.json --small_pdf_pages=5

Usage for the google_vision_pdf.py to OCR pdf to txt files.

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
  • Invoke the script passing in the input file. Eg:
python3 google_vision_pdf.py --input-file <input.pdf>
/usr/bin/python3 -m doc_curation.pdf.google_vision_pdf  --input-file <input.pdf>

For contributors

Contact

Have a problem or question? Please head to github.

Packaging

  • ~/.pypirc should have your pypi login credentials.
python setup.py bdist_wheel
twine upload dist/* --skip-existing

Build documentation

  • sphinx html docs can be generated with cd docs; make html

Testing

Run pytest in the root directory.

Auxiliary tools

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Languages