The main idea of this project is that extracting entities from the scanned Business Card.
- Extract Entities (text and data) from image of Business Card
- Entities : Name, Organization, Phone, Email and Web Address
- 1-Location of Entity
- 2-Text of Corresponding Entity
- Name:
- Designation
- Organization
- Phone
- Web Address
- Computer Vision
- Scanning Document
- Identify Location of Text
- Extract the Text from Image
Using OpenCV and Tesseract OCR
- Natural Language Processing
- Extract Entities from Text
- Cleaning and Parsing
Using Pandas, spaCy, RegEx
1-Setting up Project
- Installations
2-Data Preparation
- Extract Text and Location from Business Card
3-Labelling
- BIO Tagging
4-Data Preprocessing
- Text Cleaning and Processing
5-Training Named Entity Recognition
- Train Machine Learning Model
6-Prediction
- Parsing and Bounding Box
7-Document Scanner App
- Automatic Document Scanner App
Business Card -> Extract Text from Image Using OCR -> Text -> Text Cleaning -> Deep Learning Model Trained in spaCy for NER -> Entities
Collected Data -> Extract Text from Image Using OCR -> Text -> Labeling -> Text Cleaning -> Train NER Model in SpaCy
conda create -n docscanner python=3.9
activate docscanner
pip install -r requirements.txt
If you do not use anaconda, type this:
python -m venv docscanner
Activation:
.\docscanner\Scripts\activate
For Linux or Mac:
source <venv>/bin/activate
pip install -r requirements.txt
https://tesseract-ocr.github.io/tessdoc/Installation.html
For windows: https://digi.bib.uni-mannheim.de/tesseract/
And download this: tesseract-ocr-w32-setup-v4.1.0.20190314.exe
Note that: When you install Tesseract OCR, save the url where it is installed. It will be required in environmental setup.
After installation of tesseract, check “Environment Variables”
Click Path, and check the url. If the urls are not there, you can manually add them into environment variables.
After this installation, go terminal and type
pip install pytesseract
Go this website, https://spacy.io/usage
For Windows
pip install -U spacy
python -m spacy download en_core_web_sm
Open a page from Jupyter Notebook and import all libraries that we installed and it works without any errors!
There are 5 levels in PyTesseract.
-
Level 1 This is for defining the page. If there is only one image, then it is only one level.
-
Level 2 It defines the block.
-
Level 3 It defines the paragraph.
-
Level 4 This is for line.
-
Level 5 It is for words.
First, in Level 1 it will define the page. In that page, it will define the block and then in that block, it will detect paragraph. In paragraph, it will detect all line and in line, it will detect words. Then it will detect letters from words.
After all these steps, it will take each letters to Machine Learning model.
In this case, we only have one image which means, there is only one page.
After all these steps, it will detect all letters ( I am kinda lazy for framing each words here :) but I will draw them, you will be able to find them below)
After letters are detected, machine learning model will classify it what kind of alphabet or number etc.
Now, we will get the hierarchy from image to data using PyTesseract.
In order to get those information, there is a special command called image_to_data in PyTesseract.
When you execute the command, here is what happens:
And now, I will split the data for each line
Now what I will do is that I will take every element from there that I listed and I will also split by backslash “\t” and will create a new list. Here it is seen, first element is separated, I will apply this for all elements. And I will turn them into a Data Frame
You should also notice that one of the columns is called Level, it is what I mentioned before. Also Level defines the block numbers. And this is how we extract data from image to pandas Data Frame. Through this, we have much clear information.
In order to show, I will draw boxes according to the positions by considering what Level means.
- Level 2: Block
- Level 3: Paragraph
- Level 4: Line number
- Level 5: Text
Before drawing boxes, I should handle these missing values and types to proper form.
1- Drop Missing Values 2- Turn the Columns into integer
- l: level
- x: left
- y: top
- w: width
- h: height
- c: confidence score
This is what PyTesseract does :)
Now what I am going to do is, I will apply all these steps to all dataset.
For this, I will open a new notebook and will import libraries that I will use. By using glob, I will get paths of images and by using os I will separate filename
Like what I did in first notebook, I will also do same things and I will get a DataFrame
But now, I will only get those which their conf. is grater than 30 and I will create a new DataFrame called businessCard.
Here is the result.
I did this steps to see what will happen. Looks super! Now I will apply all these steps to all data.
After getting a new DataFrame called allBusinessCard I will save it into csv file.
Next step is, I will label this data, for example: name, organization, phone number etc.
Now, what I will do is tagging all words in the cvs file.
SOURCE: https://medium.com/analytics-vidhya/bio-tagged-text-to-original-text-99b05da6664 The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition). The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them. An O tag indicates that a token belongs to no entity / chunk.
The following figure shows how a BIO tagged sentence looks like:
Description | Tag |
---|---|
Name | NAME |
Designation | DES |
Organization | ORG |
Phone Number | PHONE |
Email Address | |
Website | WEB |
Unfortunately, there is no shortcuts of tagging. I have to do this manually inside of the csv file.
After this long and boring process, I will prepare the data for the training.
In this example from tha documentation of SpaCy , there are totally 11 arrows, as it is seen in the example which is [(0, 11, “BUILDING”)] It means, the buildings which is “Tokyo Tower” it starts from index 0 to 11st. That’s what I need to do for preparing the data, I will determine them like this.
Before starting, convert csv file to tsv (tab separated value). In order to convert the csv file, just click “Save As” and choose Tab Delimited txt file.
And, time to open the file
When we look at the data, it will look like this,
I will apply same methods what I did in 2nd notebook
This is the data what I have right now. I will also turn this into a pd DataFrame
This section will be cleaning process. In this case, I will remove white spaces and unwanted special characters because I don’t need them.
First, I will define white space, there are different ways to define that but the useful way is doing it with “string” library.
Next thing is defining special characters. But here I will not remove all special characters. For instance “@” that is important for Email.
In the above image, I also defined a function which remove white spaces and special characters.
I will apply this function to the DataFrame
Next thing what I will do is, convert the data into SpaCy format.
Basically what I try to create is that content is all information in the text, annotations is about the labels and their start and end positions.
I am not into “O” because it means outside. I am only into “B” and “I”
Lets check if annotation is correct.
as it is seen, start and end positions of phone are correct.
After this step, now I will apply these steps to all dataset.
First I will shuffle the dataset
And then, I will split the data 90% - 10%
Next thing is saving data into data folder by using “pickle” library.
In the next step, I will train a Named Entity Recognition (NER) model.
Spacy is one of the most popular and useful framework for Natural Language Processing. It is easy to use and it is a way to find a lot of predefined models.
What I will do is, take the model, use the framework and training. It is very simple.
In order to get the model, first visit this website and get Quickstart. Choose what you need, then SpaCy will give you the predefined code.
Then click download. That’s all!
In order to fill all the details of configuration, I need to type magical word to terminal
When I open this config file, there is a note which is:
python -m spacy init fill-config ./base_config.cfg ./config.cfg
I will paste it to terminal.
And it worked!
Now I will train the model by following commands.
As you see here, the format is .spacy but before I saved train and test data as .pickle. Now I will convert them into .spacy For doing this, in the documentation there is a section called Preparing Training Data It is also so easy to convert. I will copy the code and will just make some changes. That’s all!
All I need to do is just run the preprocess.py file which will make converting process from .pickle to .spacy format.
And It’s ready, that’s all here!
And as a final step in the training process, I need to train the model using the config file. Before this, I will create a folder called output to save output files.
python -m spacy train config.cfg --output output --paths.train data/train.spacy --paths.dev data/test.spacy
When I run this code, the training is start
After the training, there became two folder inside of the output file.
There are two folders which are those above the image. model-best file contains high score model which has 0.64 score, model-last contains the last one which has 0.62 score after the training.
I will use the best one for prediction.
In this section, I will test the NER model that I trained using SpaCy. All steps that I will apply are the same with that before I did.
From old notebook, I will copy and paste the function for cleaning text.
There are some ways to render the content
Here it is seen, the NER (name entity recognition) model classified the content.
Now I will tag every word again.
There are different ways to do BIO Tagging but I will use the same doc.
Token means every words. I will convert this information to DataFrame.
I also turned token into DataFrame and now I will combine it with doc_text
And this is basically what lambda function does:
After this step, I will add one more column to the DataFrame which is entities.
Here you can see, there are some “NaN” values, I will replace them with “O”
As next step, I will combine “label” with data_clean column.
If I make this joining, it will be much convenient for drawing bounding boxes.
The reason why adding “+1” is that words are separated by one space. If I do cumulative sum, I can get the end position of every words and minus one is removing space. Here is the correct end positions.
I will also create start position, in order to get start position is end position - length of word.
Now I will combine them inside of df_clean
And now, I will merge all dataframes into a new dataframe
Text and token may look like same but let’s check again by looking at last columns of the DataFrame
As it is seen, token contains clear text.
In order to draw bounding box what I have to do is that I need to take the information except the label O. For this, I will filter the main dataframe.
As next step, I will combine BIO information.
For this, I will separate labels by applying lambda function which removes first value of label.
And now I will define a class which groups texts if they are same.
According to this info, I will draw boxes again. But before doing this I will create two more columns for right and bottom positions.
For tagging, I will also groupby the dataframe by group
NOTE: I changed the image in order to get more clear value. So, next data will be different than last ones.
And entities are drawn.
Now I will combine the text where B - I tags are. At the same time I will also do parsing. For example for the phone number, I will only take digits, for e-mail address, I will only take special characters etc.
It works well! It is cleaning special characters and this is how parser will work. Now by using the entities I will save them into a dictionary, for this I will open a basic loop.
The basic idea is, for instance in the image above, B-NAME: james, I-NAME: bond, what I will do is combine them.
Except phone number, everything looks great.
Now in order to proper all codes, I will define a pipe which has all these steps and prediction function.
From notebook: 04_Predictions.ipynb I copied all steps and created them as a function. I only deleted some usefulness lines and codes. Nothing changed actually.
You can find the code in prediction.py
And notebook: 05_Final_Predictions.ipynb it is my test notebook, I test my prediction function which is prediction.py
And here are the results:
As you can realize, the model confused in detecting phone again :)
I have created a folder called VERSION_2. In this folder I only made change inside of clean_text function. In the first version, text are turning lowercase, but in the Version 2 I have canceled this. Because some organizations etc have uppercase words, that is why I have canceled and it worked slightly better. I am not saying this, accuracy is saying. Here is the accuracy reports:
In the first version the best accuracy was 0.64, but here, it is 0.72. Much better!
Here are some examples of predictions.
In this notebook, I will work on fixing images which are rotated etc. Because in order to work with PyTesseract in proper way, it is necessary. PyTesseract does not work well with rotated images.
1- Resize the image and set aspect ratio 2-Image Processing
- Enhance
- Gray Scale
- Blur
- Edge Detection
- Morphological Transform
- Contours
- Find Four Points
As you can realize, there are some noises around the image, I will apply morphological functions to clean them.
After Dilation, here is the result, as it is seen, thickness is increase, as my 2.step I will apply closing
What I will do is that I will multiply this four_points with the multiplier which is width of the original image divided by width of the resize image.
After these four points, I will wrap the original image using imutils library.
In order to analyse the images I also return resized image (which is drawn its contours) and closing image.
As next step, I will also define a function for finding great brightness and contrast.
As it can be seen, color balance of magic image is much clear, when I apply NER algorithm it will be easy to read and detect.
As summary of Magic Image is that the function increases the contrast and brightness of image.
##Integration of NER Prediction
First thing what I do is that, I import my best model which is in the version 2 and then, I read one of the images
Unfortunately, the model can not predict well because of that I didn’t feed it with more data and that’s the result. If I give more data, I would definitely get much better results. But this is not my priority, I am doing this exercise to learn and practise with PyTesseract.