This tool will take an arbitrary PDF file and run it through Google Cloud Vision and generate hOCR and PDF output for the same.
This uses the DOCUMENT_TEXT_DETECTION
operation on Cloud Vision, but could easily be adopted to just use TEXT_DETECTION
.
To convert the GCV JSON output to hOCR - a modified version of the gcv2hocr is used (which only works with TEXT_DETECTION
). To convert this hOCR output to a searchable PDF, the hocr-pdf
script from hocr-tools package is used. This script is included at ./lib/hocr-pdf.py
.
To convert a PDF using the default options just run npm run all <pdfFile>
which will output the searchable PDF file to STDOUT
. It also creates a folder called <pdfFile>_ocr
which contains the extracted images, the XML representation of the original PDF and the hOCR output for each page.
N.B. The node.js scripts contain some code that is not currently being used but provides for future functionality (including making mixed-content searchable PDFs)
Create an IAM user in Google Cloud Console with a "Service Account User" Role and give that account permissions to use the Cloud Vision API. Save the credentials from the console to credentials.json
in the root of this project.