Skip to content

A Python-based project for OCR text extraction from PDF files and image annotation using extracted text.

Notifications You must be signed in to change notification settings

ghonim0007/OCR-Text-Extraction-Annotation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

OCR Text Extraction and Annotation

This project utilizes OCR technology to extract text from PDF files, merging text extraction results from different language models, and then annotating the text on images converted from PDF pages.

Description

The pytesseract library is used for text extraction from images. The models used are:

  • ISOCP.traineddata: A model for the German language
  • ISOCP1.traineddata: A model for the English language

The extracted texts are merged, and rectangles are drawn around the text on the converted image.

How to Use

  1. Install the required libraries

    Ensure that the required libraries are installed using pip:

    pip install -r requirements.txt
    

Python

🚀 Features

  • Language: Python
  • Dependencies: pytesseract, PIL, PDF2Image
  • License: MIT

Installation

  1. Clone the repository:
    git clone https://github.com/ghonim0007/OCR-Text-Extraction-Annotation.git