Apply OCR to Scanned image PDF files

Apply OCR on scanned PDF files to extract text from the PDF images. This version expects the texts to be written on brazilian portuguese (pt-br).

Setup

To setup the environment on Ubuntu run the code on Ubuntu terminal:

chmod a+x setup.sh   # run this line only the first time
./setup.sh

The code above will install Tesseract, brazilian portuguese language to the Tesseract OCR, imageMagick, and setup the policy.xml file from imageMagick to convert PDF files.

Execute OCR

Copy the script pdf_ocr.sh to the folder containing the scanned PDF files and execute it:

chmod a+x pdf_ocr.sh   # run this line only the first time
./pdf_ocr.sh

Output

The script outputs the following:

a txt file containing the text extracted from the PDF;
a searchable PDF file containing the text extracted from the PDF;
a hocr file containing the text extracted from the PDF.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
pdf_ocr.sh		pdf_ocr.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apply OCR to Scanned image PDF files

Setup

Execute OCR

Output

About

Releases

Packages

Languages

License

paulocressoni/scanned_pdf_ocr

Folders and files

Latest commit

History

Repository files navigation

Apply OCR to Scanned image PDF files

Setup

Execute OCR

Output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages