I submitted my results to the competition with the last model checkpoint. Unfortunately, my results are not reproducibles across different machines but were the same on my machine through multiple executions. Indeed, after having seeded the training, the results obtained were slightly different on the borrowed NVIDIA GTX 1080. This is not surprising because it is mentioned in the Pytorch documentation about reproducibility. That is why I gave up on seeding the training. One last important thing to note is this repository aims to be the base code for scientific researches on OCR.
- Pytorch 1.7.0: Currently Pytorch 1.7.0 or higher is supported.
- Metrics visualization: Support
Visdom
for real-time loss visualization during training. - Automatic mixed precision training: Train faster with AMP and less GPU memory on NVIDIA tensor cores
- Pretrained models provided: Load the pretrained weights.
- GPU/CPU support for inference: Runs on GPU or CPU in inference time.
- Intelligent training procedure: The state of the model, optimizer, learning rate scheduler and so on... You can stop your training and resume training exactly from the checkpoint.
Scanned receipts OCR is a process of recognizing text from scanned structured and semi-structured receipts, and invoices in general. Indeed,
Scanned receipts OCR and information extraction (SROIE) play critical roles in streamlining document-intensive processes and office automation in many financial, accounting and taxation areas.
For further info, check the ICDAR19-RRC-SROIE competition.
The dataset has 1000 whole scanned receipt images. Each receipt image contains around about four key text fields, such as goods name, unit price and total cost, etc. The text annotated in the dataset mainly consists of digits and English characters. An example scanned receipt is shown below:
The dataset is split into a training/validation set (“trainval”), and a test set (“test”). The “trainval” set consists of 600 receipt images along with their annotations. The “test” set consists of 400 images.
For receipt OCR task, each image in the dataset is annotated with text bounding boxes (bbox) and the transcript of each text bbox. Locations are annotated as rectangles with four vertices, which are in clockwise order starting from the top. Annotations for an image are stored in a text file with the same file name. The annotation format is similar to that of ICDAR2015 dataset, which is shown below:
x1_1,y1_1,x2_1,y2_1,x3_1,y3_1,x4_1,y4_1,transcript_1
x1_2,y1_2,x2_2,y2_2,x3_2,y3_2,x4_2,y4_2,transcript_2
x1_3,y1_3,x2_3,y2_3,x3_3,y3_3,x4_3,y4_3,transcript_3
…
For the information extraction task, each image in the dataset is annotated with a text file with format shown below:
{
"company": "STARBUCKS STORE #10208",
"address": "11302 EUCLID AVENUE, CLEVELAND, OH (216) 229-0749",
"date": "14/03/2015",
"total": "4.95"
}
The original dataset provided on the SROIE 2019 competition contains many big mistakes. One of them is the missing file
in task1_2_test(361p)
. Indeed, the number of files in task1_2_test(361p)
and text.task1_2-test(361p)
are not the
same (360 and 361 respectively). The reason is that this filename X51006619570.jpg
is missing, and it turns out that
it was in task3-test 347p) -
.
Another mistakes lie within the folder 0325updated.task2train(626p)
and 0325updated.task1train(626p)
which are used for
the Task 3: Keyword Information Extraction. Indeed, there are files for which
the date format, address and company are wrong and here are the corrections that were made:
Directory | Filename | Correction |
---|---|---|
0325updated.task2train(626p | X51005447850 | Change 20180304 into 04/03/2018 |
0325updated.task1train(626p) & 0325updated.task2train(626p) | X51005715010 | Change 25032018 into 25.03.2018 |
0325updated.task1train(626p) & 0325updated.task2train(626p) | X51006466055 | Change 20180428 into 2018-04-28 |
0325updated.task2train(626p) | X51008114284 | Remove one occurrence of 'KAWASAN PERINDUSTRIAN BALAKONG,' from the address |
0325updated.task2train(626p) | X00016469620 | Remove " (MR DIY TESCO TERBAU)" from the address |
0325updated.task2train(626p) | X00016469623 | Remove " (TESCO PUTRA NILAI)" from the address |
0325updated.task2train(626p) | X51006502531 | Change "FAMILYMART" into "MAXINCOME RESOURCES SDN BHD (383322-D)" from the company |
The Original dataset can be found Google Drive or Baidu NetDisk.
Taking into account to what was mentioned in dataset mistakes above, you may obviously want to make the changes by your own, but I have already made the corrections, and it can be downloaded via the bash script: sroie2019.sh and here is how to run it:
-
Without specifying the directory, then it will download inside the default directory
~/SROIE2019
:sh scripts/datasets/sroie2019.sh
-
Specifying a new directory, let's say
~/dataset/ICDAR2019
:sh scripts/datasets/sroie2019.sh ~/dataset/ICDAR2019
Do not forget to specify the new directory inside this file.
For Windows users who do not have bash
on their system, you may want to
install git bash. Once it is installed, you can set the entire git bin
folder in
the environment variables
.
Here are methods used for the competition. Inside each folder representing the task name, there are documentations of the proposed method and the training, demo and evaluation procedures as well.
- Task 1 - Text Localization: Connectionist Text Proposal Network (CTPN).
- Task 3 - Keyword Information Extraction: Character-Aware CNN + Highway + BiLSTM (CharLM).
The results are listed as follows (Note that for the task 3, I manually fix each and every OCR mismatches for fair comparison results):
Task | Recall | Precision | Hmean | Evaluation Method | Model | Parameters | Model Size | Weights |
---|---|---|---|---|---|---|---|---|
Task 1 | 97.52% | 97.40% | 97.46% | Deteval | CTPN | 18,450,332 | 280 MB | Last checkpoint |
task 3 | 98.20% | 98.48% | 98.34% | / | CharLM | 4,740,590 | 72.4 MB | last checkpoint |
NVIDIA GPU (with tensor cores in order to use automatic mixed precision) + CUDNN if possible is strongly recommended. It's also possible to run the program on CPU only, but it will be extremely slow.
Besides, all the experiments and results were performed on my personal gaming computer:
- Alienware Area-51m R2
- 10th Gen Intel(R) Core(TM) i9 10900K (10-Core, 20 MB Cache, 3.7GHz to 5.3GHz w/Thermal Velocity Boost)
- NVIDIA® GeForce® RTX 2070 SUPER™, 8Go GDDR6
- OS: Dual boot Windows/Ubuntu 20.04
and DIVA GPU cluster:
- 9th Gen Intel(R) Core(TM) i9 9900K (8-core, 16 MB cache)
- NVIDIA® GeForce® GTX 1080
- OS: Ubuntu 18.04
For Mac, Windows and Linux users, if conda
is not installed, then you need to follow
this documentation.
-
Updating conda
Before installation, we need to make sure
conda
is updated.conda update conda
-
Creating an environment from a file
conda env create -f env/environment.yml
This will create a new conda environment named
SROIE2019
on your system, which will give you all the packages needed for this repo. If you do not own any NVIDIA GPUs (with CUDA capable-system), then you must remove thecudatoolkit
andcudnn
lines in the environment.yml file. Otherwise, make sure your graphic card supports the installed version of CUDA. -
Activating the new environment
conda activate SROIE2019
or
source activate SROIE2019
If you want to deactivate the environment, you can simply do:
conda deactivate
-
Verify that the new environment was installed correctly
conda env list
or
conda info --envs
for further info, you can check the manager .
To use Vidsom
, you must make sure the server is running before you run the training.
python3 -m visdom.server
or simply
visdom
In your browser, you can go to:
You will see the visdom interface:
One important thing to remember...You can launch the server with a specific port. Let's say: visdom --port 8198
. But
you need to make sure the Visualizer runs with the port 8198
.
For further info on Visdom, you can check this: https://github.com/fossasia/visdom
If you have issues running or compiling this code, there are a list of common issues in TROUBLESHOOTING.md. If your issue is not present there, then feel free to open a new issue.
This repository is influenced by great works such as :
- Luffic-SSD for the data augmentation and anchor matching parts.
- eragonruan, courao , tranbahien for the implementation of the CTPN used to tackle the text localization.
- eadst for the implementation for removing the extra white space in the scanned receipts for the text localization.
- HephaestusProject for the implementation of Character-Aware Neural Language Models used to tackle the keyword-information extraction.
If you use this project in your research, please cite it as follows:
@misc{blackstar1313_sroie_2019,
author = {Njoyim Tchoubith Peguy Calusha},
title = {ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction},
year = {2021},
url = {https://github.com/BlackStar1313/ICDAR-2019-RRC-SROIE}
}
Here is a to-do list which should be complete subsequently.
- Support for the Docker images.
- Support for the task 2: Scanned Receipts OCR.