Business Card Reader App

The main idea of this project is that extracting entities from the scanned Business Card.

Project Features:

Extract Entities (text and data) from image of Business Card
- Entities : Name, Organization, Phone, Email and Web Address

Tasks:

1-Location of Entity
2-Text of Corresponding Entity

Examples:

Name:
Designation
Organization
Phone
Email
Web Address

Technologies:

Computer Vision
- Scanning Document
- Identify Location of Text
- Extract the Text from Image

Using OpenCV and Tesseract OCR

Natural Language Processing
- Extract Entities from Text
- Cleaning and Parsing

Using Pandas, spaCy, RegEx

Stages of Development

1-Setting up Project

Installations

2-Data Preparation

Extract Text and Location from Business Card

3-Labelling

BIO Tagging

4-Data Preprocessing

Text Cleaning and Processing

5-Training Named Entity Recognition

Train Machine Learning Model

6-Prediction

Parsing and Bounding Box

7-Document Scanner App

Automatic Document Scanner App

Architecture

Business Card -> Extract Text from Image Using OCR -> Text -> Text Cleaning -> Deep Learning Model Trained in spaCy for NER -> Entities

Training Architecture

Collected Data -> Extract Text from Image Using OCR -> Text -> Labeling -> Text Cleaning -> Train NER Model in SpaCy

Installations

Environment Installation

conda create -n docscanner python=3.9

activate docscanner

pip install -r requirements.txt

If you do not use anaconda, type this:

python -m venv docscanner

Activation:

.\docscanner\Scripts\activate

For Linux or Mac:

source <venv>/bin/activate

pip install -r requirements.txt

Install Tesseract OCR and Pytesseract

Installation for Tesseract OCR

https://tesseract-ocr.github.io/tessdoc/Installation.html

For windows: https://digi.bib.uni-mannheim.de/tesseract/

And download this: tesseract-ocr-w32-setup-v4.1.0.20190314.exe

Note that: When you install Tesseract OCR, save the url where it is installed. It will be required in environmental setup.

After installation of tesseract, check “Environment Variables”

Click Path, and check the url. If the urls are not there, you can manually add them into environment variables.

Installation of PyTesseract

After this installation, go terminal and type

pip install pytesseract

Instalation of spaCy

Go this website, https://spacy.io/usage

For Windows

pip install -U spacy
python -m spacy download en_core_web_sm

Section 1 - Data Preparation with PyTesseract

Notebook: 01_PyTesseract.ipynb

Open a page from Jupyter Notebook and import all libraries that we installed and it works without any errors!

Hierarchy of PyTesseract - How it works -

There are 5 levels in PyTesseract.

Level 1 This is for defining the page. If there is only one image, then it is only one level.
Level 2 It defines the block.
Level 3 It defines the paragraph.
Level 4 This is for line.
Level 5 It is for words.

First, in Level 1 it will define the page. In that page, it will define the block and then in that block, it will detect paragraph. In paragraph, it will detect all line and in line, it will detect words. Then it will detect letters from words.

After all these steps, it will take each letters to Machine Learning model.

Level 1 - Page

In this case, we only have one image which means, there is only one page.

Level 2 - Block

Level 3 - Paragraph

Level 4 - Line

Level 5 - Words

After all these steps, it will detect all letters ( I am kinda lazy for framing each words here :) but I will draw them, you will be able to find them below)

After letters are detected, machine learning model will classify it what kind of alphabet or number etc.

Section 1.2

Now, we will get the hierarchy from image to data using PyTesseract.

In order to get those information, there is a special command called image_to_data in PyTesseract.

When you execute the command, here is what happens:

And now, I will split the data for each line

Now what I will do is that I will take every element from there that I listed and I will also split by backslash “\t” and will create a new list. Here it is seen, first element is separated, I will apply this for all elements. And I will turn them into a Data Frame

You should also notice that one of the columns is called Level, it is what I mentioned before. Also Level defines the block numbers. And this is how we extract data from image to pandas Data Frame. Through this, we have much clear information.

In order to show, I will draw boxes according to the positions by considering what Level means.

Level 2: Block
Level 3: Paragraph
Level 4: Line number
Level 5: Text

Before drawing boxes, I should handle these missing values and types to proper form.

1- Drop Missing Values 2- Turn the Columns into integer

Drawing:

l: level
x: left
y: top
w: width
h: height
c: confidence score

This is what PyTesseract does :)

Section 2 - Data Preprocessing and Preparation

Notebook: 02_Data_Preparation.ipynb

Now what I am going to do is, I will apply all these steps to all dataset.

For this, I will open a new notebook and will import libraries that I will use. By using glob, I will get paths of images and by using os I will separate filename

Like what I did in first notebook, I will also do same things and I will get a DataFrame

But now, I will only get those which their conf. is grater than 30 and I will create a new DataFrame called businessCard.

Here is the result.

I did this steps to see what will happen. Looks super! Now I will apply all these steps to all data.

After getting a new DataFrame called allBusinessCard I will save it into csv file.

Next step is, I will label this data, for example: name, organization, phone number etc.

Labeling

Now, what I will do is tagging all words in the cvs file.

BIO / IOB Format

SOURCE: https://medium.com/analytics-vidhya/bio-tagged-text-to-original-text-99b05da6664 The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition). The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them. An O tag indicates that a token belongs to no entity / chunk.

The following figure shows how a BIO tagged sentence looks like:

Entities

Description	Tag
Name	NAME
Designation	DES
Organization	ORG
Phone Number	PHONE
Email Address	EMAIL
Website	WEB

Unfortunately, there is no shortcuts of tagging. I have to do this manually inside of the csv file.

After this long and boring process, I will prepare the data for the training.

Section 3 - Data Preprocessing and Cleaning

Notebook: 03_Data_Preprocessing.ipynb

1-Data Preprocessing

SpaCy Data Format:

In this example from tha documentation of SpaCy , there are totally 11 arrows, as it is seen in the example which is [(0, 11, “BUILDING”)] It means, the buildings which is “Tokyo Tower” it starts from index 0 to 11st. That’s what I need to do for preparing the data, I will determine them like this.

Link: https://spacy.io/usage/training#basics

Before starting, convert csv file to tsv (tab separated value). In order to convert the csv file, just click “Save As” and choose Tab Delimited txt file.

And, time to open the file

When we look at the data, it will look like this,

I will apply same methods what I did in 2nd notebook

This is the data what I have right now. I will also turn this into a pd DataFrame

2-Data Cleaning

This section will be cleaning process. In this case, I will remove white spaces and unwanted special characters because I don’t need them.

First, I will define white space, there are different ways to define that but the useful way is doing it with “string” library.

Next thing is defining special characters. But here I will not remove all special characters. For instance “@” that is important for Email.

In the above image, I also defined a function which remove white spaces and special characters.

I will apply this function to the DataFrame

Next thing what I will do is, convert the data into SpaCy format.

Converting to SpaCy Format:

Basically what I try to create is that content is all information in the text, annotations is about the labels and their start and end positions.

I am not into “O” because it means outside. I am only into “B” and “I”

Lets check if annotation is correct.

as it is seen, start and end positions of phone are correct.

After this step, now I will apply these steps to all dataset.

Splitting Data

First I will shuffle the dataset

And then, I will split the data 90% - 10%

Next thing is saving data into data folder by using “pickle” library.

In the next step, I will train a Named Entity Recognition (NER) model.

Section 4 - Train Named Entity Recognition (NER)

Code: preprocess . py

Spacy is one of the most popular and useful framework for Natural Language Processing. It is easy to use and it is a way to find a lot of predefined models.

https://spacy.io/ https://spacy.io/usage/training

What I will do is, take the model, use the framework and training. It is very simple.

In order to get the model, first visit this website and get Quickstart. Choose what you need, then SpaCy will give you the predefined code.

Then click download. That’s all!

In order to fill all the details of configuration, I need to type magical word to terminal

When I open this config file, there is a note which is:

python -m spacy init fill-config ./base_config.cfg ./config.cfg

I will paste it to terminal.

And it worked!

Now I will train the model by following commands.

As you see here, the format is .spacy but before I saved train and test data as .pickle. Now I will convert them into .spacy For doing this, in the documentation there is a section called Preparing Training Data It is also so easy to convert. I will copy the code and will just make some changes. That’s all!

All I need to do is just run the preprocess.py file which will make converting process from .pickle to .spacy format.

And It’s ready, that’s all here!

And as a final step in the training process, I need to train the model using the config file. Before this, I will create a folder called output to save output files.

python -m spacy train config.cfg --output output --paths.train data/train.spacy --paths.dev data/test.spacy

When I run this code, the training is start

After the training, there became two folder inside of the output file.

There are two folders which are those above the image. model-best file contains high score model which has 0.64 score, model-last contains the last one which has 0.62 score after the training.

I will use the best one for prediction.

Section 5- Prediction

Notebook: 04_Predictions.ipynb

In this section, I will test the NER model that I trained using SpaCy. All steps that I will apply are the same with that before I did.

From old notebook, I will copy and paste the function for cleaning text.

STEPS:

1- Load NER Model

2- Load Image

3- Extract Data from Text using Pytesseract

4- Convert into DataFrame

5- Convert Data into Content

6- Get Predictions from NER Model and Render the Content

There are some ways to render the content

7- Render the Content

Here it is seen, the NER (name entity recognition) model classified the content.

Now I will tag every word again.

Tagging

There are different ways to do BIO Tagging but I will use the same doc.

Token means every words. I will convert this information to DataFrame.

I also turned token into DataFrame and now I will combine it with doc_text

And this is basically what lambda function does:

After this step, I will add one more column to the DataFrame which is entities.

Here you can see, there are some “NaN” values, I will replace them with “O”

As next step, I will combine “label” with data_clean column.

If I make this joining, it will be much convenient for drawing bounding boxes.

The reason why adding “+1” is that words are separated by one space. If I do cumulative sum, I can get the end position of every words and minus one is removing space. Here is the correct end positions.

I will also create start position, in order to get start position is end position - length of word.

Now I will combine them inside of df_clean

And now, I will merge all dataframes into a new dataframe

Text and token may look like same but let’s check again by looking at last columns of the DataFrame

As it is seen, token contains clear text.

Bounding Box

In order to draw bounding box what I have to do is that I need to take the information except the label O. For this, I will filter the main dataframe.

And result,

As next step, I will combine BIO information.

For this, I will separate labels by applying lambda function which removes first value of label.

And now I will define a class which groups texts if they are same.

According to this info, I will draw boxes again. But before doing this I will create two more columns for right and bottom positions.

For tagging, I will also groupby the dataframe by group

NOTE: I changed the image in order to get more clear value. So, next data will be different than last ones.

And entities are drawn.

Now I will combine the text where B - I tags are. At the same time I will also do parsing. For example for the phone number, I will only take digits, for e-mail address, I will only take special characters etc.

It works well! It is cleaning special characters and this is how parser will work. Now by using the entities I will save them into a dictionary, for this I will open a basic loop.

The basic idea is, for instance in the image above, B-NAME: james, I-NAME: bond, what I will do is combine them.

The result:

Except phone number, everything looks great.

Now in order to proper all codes, I will define a pipe which has all these steps and prediction function.

From notebook: 04_Predictions.ipynb I copied all steps and created them as a function. I only deleted some usefulness lines and codes. Nothing changed actually.

You can find the code in prediction.py

And notebook: 05_Final_Predictions.ipynb it is my test notebook, I test my prediction function which is prediction.py

And here are the results:

Test 1 of Version 1

As you can realize, the model confused in detecting phone again :)

Test 2 of Version 1

Test 3 of Version 1

NOTE:

I have created a folder called VERSION_2. In this folder I only made change inside of clean_text function. In the first version, text are turning lowercase, but in the Version 2 I have canceled this. Because some organizations etc have uppercase words, that is why I have canceled and it worked slightly better. I am not saying this, accuracy is saying. Here is the accuracy reports:

In the first version the best accuracy was 0.64, but here, it is 0.72. Much better!

Here are some examples of predictions.

Test 1 of Version 2

Test 2 of Version 2

Test 3 of Version 2

Section 6 - Document Scanner

Notebook: Document_Scanner.ipynb

In this notebook, I will work on fixing images which are rotated etc. Because in order to work with PyTesseract in proper way, it is necessary. PyTesseract does not work well with rotated images.

Steps:

1- Resize the image and set aspect ratio 2-Image Processing

Enhance
Gray Scale
Blur
Edge Detection
Morphological Transform
Contours
Find Four Points

1-Resize the image and set aspect ratio

2-Image Processing

Enhance

Edge Detection

As you can realize, there are some noises around the image, I will apply morphological functions to clean them.

After Dilation, here is the result, as it is seen, thickness is increase, as my 2.step I will apply closing

Closing

Now I will find the contours.

What I will do is that I will multiply this four_points with the multiplier which is width of the original image divided by width of the resize image.

After these four points, I will wrap the original image using imutils library.

And it is time to define a function which does all these steps

In order to analyse the images I also return resized image (which is drawn its contours) and closing image.

Here is an example:

Another Example:

As next step, I will also define a function for finding great brightness and contrast.

Here is its example:

As it can be seen, color balance of magic image is much clear, when I apply NER algorithm it will be easy to read and detect.

Another example:

As summary of Magic Image is that the function increases the contrast and brightness of image.

##Integration of NER Prediction

First thing what I do is that, I import my best model which is in the version 2 and then, I read one of the images

Here are results:

Unfortunately, the model can not predict well because of that I didn’t feed it with more data and that’s the result. If I give more data, I would definitely get much better results. But this is not my priority, I am doing this exercise to learn and practise with PyTesseract.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
VERSION_2		VERSION_2
.gitignore		.gitignore
01_PyTesseract.ipynb		01_PyTesseract.ipynb
02_Data_Preparation.ipynb		02_Data_Preparation.ipynb
03_Data_Preprocessing.ipynb		03_Data_Preprocessing.ipynb
04_Predictions.ipynb		04_Predictions.ipynb
05_Final_Predictions.ipynb		05_Final_Predictions.ipynb
Document_Scanner.ipynb		Document_Scanner.ipynb
README.md		README.md
Test the Libraries.ipynb		Test the Libraries.ipynb
base_config.cfg		base_config.cfg
config.cfg		config.cfg
predictions.py		predictions.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt

ierolsen/Business-Card-Reader-App

Folders and files

Latest commit

History

Repository files navigation

Business Card Reader App

Project Features:

Tasks:

Examples:

Technologies:

Stages of Development

Architecture

Training Architecture

Installations

Environment Installation

Install Tesseract OCR and Pytesseract

Installation for Tesseract OCR

Installation of PyTesseract

Instalation of spaCy

Section 1 - Data Preparation with PyTesseract

Notebook: 01_PyTesseract.ipynb

Hierarchy of PyTesseract - How it works -

Level 1 - Page

Level 2 - Block

Level 3 - Paragraph

Level 4 - Line

Level 5 - Words

Section 1.2

Drawing:

Section 2 - Data Preprocessing and Preparation

Notebook: 02_Data_Preparation.ipynb

Labeling

BIO / IOB Format

Entities

Section 3 - Data Preprocessing and Cleaning

Notebook: 03_Data_Preprocessing.ipynb

1-Data Preprocessing

SpaCy Data Format:

2-Data Cleaning

Converting to SpaCy Format:

Splitting Data

Section 4 - Train Named Entity Recognition (NER)

Code: preprocess . py

Section 5- Prediction

Notebook: 04_Predictions.ipynb

STEPS:

1- Load NER Model

2- Load Image

3- Extract Data from Text using Pytesseract

4- Convert into DataFrame

5- Convert Data into Content

6- Get Predictions from NER Model and Render the Content

7- Render the Content

Tagging

Bounding Box

Test 1 of Version 1

Test 2 of Version 1

Test 3 of Version 1

NOTE:

Test 1 of Version 2

Test 2 of Version 2

Test 3 of Version 2

Section 6 - Document Scanner

Notebook: Document_Scanner.ipynb

Steps:

1-Resize the image and set aspect ratio

2-Image Processing

Enhance

Edge Detection

Closing

And it is time to define a function which does all these steps

Here is an example:

Another Example:

Another Example:

Another Example:

Here is its example:

Another example:

Here are results:

About

Topics

Packages