install.sh is for Debian Linux Prerequisites: apt and python above 3.8
The project includes 3 main parts:
PDF Text Extractor - extracts text from PDF
Image Extractor from PDF - extracts images and saves it to a folder
Text Visualizer - Visualize the text to see what the computer recognizes
If on debian linux do
Steps:- Install tesseract-ocr and libtesseract-dev using your os package installed
- Create a virual env python3 -m venv venv
- source venv/bin/activate
- Install all libraries required pip install -r requirments.txt
Depending on your work load either use main.py if you want a graphical interface or maincli.py to use command line argumets
For mainCLI.py you can use either syntax
python3 main.py PDFfile
or
python3 main.py PDFfile -o outputFileName
For visualizer.py the syntax is
python3 visualizer.py PDFfile