OCR Text Extraction and Annotation

This project utilizes OCR technology to extract text from PDF files, merging text extraction results from different language models, and then annotating the text on images converted from PDF pages.

Description

The pytesseract library is used for text extraction from images. The models used are:

ISOCP.traineddata: A model for the German language
ISOCP1.traineddata: A model for the English language

The extracted texts are merged, and rectangles are drawn around the text on the converted image.

How to Use

Install the required libraries

Ensure that the required libraries are installed using pip:
```
pip install -r requirements.txt
```

🚀 Features

Language: Python
Dependencies: pytesseract, PIL, PDF2Image
License: MIT

Installation

Clone the repository:

git clone https://github.com/ghonim0007/OCR-Text-Extraction-Annotation.git

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Text Extraction and Annotation

Description

How to Use

🚀 Features

Installation

About

Releases 1

Packages

ghonim0007/OCR-Text-Extraction-Annotation

Folders and files

Latest commit

History

Repository files navigation

OCR Text Extraction and Annotation

Description

How to Use

🚀 Features

Installation

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Packages