Master's thesis defense project.
The classic method of processing and extracting text from an image (OCR) in cooperation with the Entity Recognition (NER) technology in a trained algorithm based on a set of business cards.
PL: Klasyczna metoda przetwarzania oraz wyodrębniania tekstu z obrazu (OCR) we współpracy z technologią rozpoznawania jednostek (NER) w przeszkolonym algorytmie na podstawie zbioru wizytówek.
Python
Jupyter Notebook
OCR
OpenCV
Tesseract OCR
NER
Pandas
SpaCy
RegEx
Computer vision scans the document, identifies the position of the text and eventually extracts the text from the image. Natural language processing extracts units from text. The document in image form is read using OCR technology to extract text in editable form. The extracted text is cleaned and passed to a learning model that is trained to recognize names. Finally, the named units from this model will be generated.
The scheme of the process and the operation of the application can be described in ten steps:
- The process of sending documents via desktop or mobile devices.
- Paper documents, submissions and emails containing scans or photos documents.
- A collection of a certain number of files containing documents as a base.
- The process of analyzing photos and scans of documents by OCR technology.
- Extraction of text from the document base.
- Generation of text data and preprocessing and data cleaning.
- Labeling test data with the BIO system for training the NER model.
- NER model training process.
- Extracting text data with named units from documents.
Use a single pipeline code file named predictions.py, with all the necessary functions. Comments are included in the code.