This project utilizes OCR technology to extract text from PDF files, merging text extraction results from different language models, and then annotating the text on images converted from PDF pages.
The pytesseract
library is used for text extraction from images. The models used are:
- ISOCP.traineddata: A model for the German language
- ISOCP1.traineddata: A model for the English language
The extracted texts are merged, and rectangles are drawn around the text on the converted image.
-
Install the required libraries
Ensure that the required libraries are installed using
pip
:pip install -r requirements.txt
- Language: Python
- Dependencies:
pytesseract
,PIL
,PDF2Image
- License: MIT
- Clone the repository:
git clone https://github.com/ghonim0007/OCR-Text-Extraction-Annotation.git