Document NER

Master's thesis defense project.

General info

The classic method of processing and extracting text from an image (OCR) in cooperation with the Entity Recognition (NER) technology in a trained algorithm based on a set of business cards.

_{PL: Klasyczna metoda przetwarzania oraz wyodrębniania tekstu z obrazu (OCR) we współpracy z technologią rozpoznawania jednostek (NER) w przeszkolonym algorytmie na podstawie zbioru wizytówek.}

Technologies

Python
Jupyter Notebook
OCR
OpenCV
Tesseract OCR
NER
Pandas
SpaCy
RegEx

Solution architecture

Computer vision scans the document, identifies the position of the text and eventually extracts the text from the image. Natural language processing extracts units from text. The document in image form is read using OCR technology to extract text in editable form. The extracted text is cleaned and passed to a learning model that is trained to recognize names. Finally, the named units from this model will be generated.

Scheme of the process

The scheme of the process and the operation of the application can be described in ten steps:

The process of sending documents via desktop or mobile devices.
Paper documents, submissions and emails containing scans or photos documents.
A collection of a certain number of files containing documents as a base.
The process of analyzing photos and scans of documents by OCR technology.
Extraction of text from the document base.
Generation of text data and preprocessing and data cleaning.
Labeling test data with the BIO system for training the NER model.
NER model training process.
Extracting text data with named units from documents.

Setup

Use a single pipeline code file named predictions.py, with all the necessary functions. Comments are included in the code.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Selected		Selected
__pycache__		__pycache__
data		data
output		output
01_Pytesseract.ipynb		01_Pytesseract.ipynb
02_Data_Preparation.ipynb		02_Data_Preparation.ipynb
03_Data_Preprocessing.ipynb		03_Data_Preprocessing.ipynb
04_Predictions.ipynb		04_Predictions.ipynb
05_Final_predictions.ipynb		05_Final_predictions.ipynb
README.md		README.md
base_config.cfg		base_config.cfg
config.cfg		config.cfg
documents.csv		documents.csv
documents.txt		documents.txt
predictions.py		predictions.py
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document NER

General info

Technologies

Solution architecture

Scheme of the process

Setup

About

Releases

Packages

Languages

PatrykBala/DocumentNER

Folders and files

Latest commit

History

Repository files navigation

Document NER

General info

Technologies

Solution architecture

Scheme of the process

Setup

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages