Skip to content

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

License

Notifications You must be signed in to change notification settings

NjoyimPeguy/ICDAR-2019-RRC-SROIE

Repository files navigation

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

I submitted my results to the competition with the last model checkpoint. Unfortunately, my results are not reproducibles across different machines but were the same on my machine through multiple executions. Indeed, after having seeded the training, the results obtained were slightly different on the borrowed NVIDIA GTX 1080. This is not surprising because it is mentioned in the Pytorch documentation about reproducibility. That is why I gave up on seeding the training. One last important thing to note is this repository aims to be the base code for scientific researches on OCR.

Table of contents

  1. Highlights
  2. Challenge
    1. Overview
    2. Dataset and Annotations
      1. Description
      2. Mistakes
      3. Downloads
  3. Methods
  4. Results
  5. User guide
    1. Software & hardware
    2. Conda environment setup
    3. Visualizer
  6. Troubleshooting
  7. References
  8. Citations
  9. TODO

Highlights

  • Pytorch 1.7.0: Currently Pytorch 1.7.0 or higher is supported.
  • Metrics visualization: Support Visdom for real-time loss visualization during training.
  • Automatic mixed precision training: Train faster with AMP and less GPU memory on NVIDIA tensor cores
  • Pretrained models provided: Load the pretrained weights.
  • GPU/CPU support for inference: Runs on GPU or CPU in inference time.
  • Intelligent training procedure: The state of the model, optimizer, learning rate scheduler and so on... You can stop your training and resume training exactly from the checkpoint.

Challenge

Overview

Scanned receipts OCR is a process of recognizing text from scanned structured and semi-structured receipts, and invoices in general. Indeed,

Scanned receipts OCR and information extraction (SROIE) play critical roles in streamlining document-intensive processes and office automation in many financial, accounting and taxation areas.

For further info, check the ICDAR19-RRC-SROIE competition.

Dataset and Annotations

Description

The dataset has 1000 whole scanned receipt images. Each receipt image contains around about four key text fields, such as goods name, unit price and total cost, etc. The text annotated in the dataset mainly consists of digits and English characters. An example scanned receipt is shown below:

The dataset is split into a training/validation set (“trainval”), and a test set (“test”). The “trainval” set consists of 600 receipt images along with their annotations. The “test” set consists of 400 images.

For receipt OCR task, each image in the dataset is annotated with text bounding boxes (bbox) and the transcript of each text bbox. Locations are annotated as rectangles with four vertices, which are in clockwise order starting from the top. Annotations for an image are stored in a text file with the same file name. The annotation format is similar to that of ICDAR2015 dataset, which is shown below:

x1_1,y1_1,x2_1,y2_1,x3_1,y3_1,x4_1,y4_1,transcript_1

x1_2,y1_2,x2_2,y2_2,x3_2,y3_2,x4_2,y4_2,transcript_2

x1_3,y1_3,x2_3,y2_3,x3_3,y3_3,x4_3,y4_3,transcript_3

…

For the information extraction task, each image in the dataset is annotated with a text file with format shown below:

{
  "company": "STARBUCKS STORE #10208",
  "address": "11302 EUCLID AVENUE, CLEVELAND, OH (216) 229-0749",
  "date": "14/03/2015",
  "total": "4.95"
}

Mistakes

The original dataset provided on the SROIE 2019 competition contains many big mistakes. One of them is the missing file in task1_2_test(361p). Indeed, the number of files in task1_2_test(361p) and text.task1_2-test(361p) are not the same (360 and 361 respectively). The reason is that this filename X51006619570.jpg is missing, and it turns out that it was in task3-test 347p) -.

Another mistakes lie within the folder 0325updated.task2train(626p) and 0325updated.task1train(626p) which are used for the Task 3: Keyword Information Extraction. Indeed, there are files for which the date format, address and company are wrong and here are the corrections that were made:

Directory Filename Correction
0325updated.task2train(626p X51005447850 Change 20180304 into 04/03/2018
0325updated.task1train(626p) & 0325updated.task2train(626p) X51005715010 Change 25032018 into 25.03.2018
0325updated.task1train(626p) & 0325updated.task2train(626p) X51006466055 Change 20180428 into 2018-04-28
0325updated.task2train(626p) X51008114284 Remove one occurrence of 'KAWASAN PERINDUSTRIAN BALAKONG,' from the address
0325updated.task2train(626p) X00016469620 Remove " (MR DIY TESCO TERBAU)" from the address
0325updated.task2train(626p) X00016469623 Remove " (TESCO PUTRA NILAI)" from the address
0325updated.task2train(626p) X51006502531 Change "FAMILYMART" into "MAXINCOME RESOURCES SDN BHD (383322-D)" from the company

Downloads

The Original dataset can be found Google Drive or Baidu NetDisk.

Taking into account to what was mentioned in dataset mistakes above, you may obviously want to make the changes by your own, but I have already made the corrections, and it can be downloaded via the bash script: sroie2019.sh and here is how to run it:

  • Without specifying the directory, then it will download inside the default directory ~/SROIE2019:

    sh scripts/datasets/sroie2019.sh
    
  • Specifying a new directory, let's say ~/dataset/ICDAR2019:

    sh scripts/datasets/sroie2019.sh ~/dataset/ICDAR2019
    

    Do not forget to specify the new directory inside this file.

For Windows users who do not have bash on their system, you may want to install git bash. Once it is installed, you can set the entire git bin folder in the environment variables .

Methods

Here are methods used for the competition. Inside each folder representing the task name, there are documentations of the proposed method and the training, demo and evaluation procedures as well.

  • Task 1 - Text Localization: Connectionist Text Proposal Network (CTPN).
  • Task 3 - Keyword Information Extraction: Character-Aware CNN + Highway + BiLSTM (CharLM).

Results

The results are listed as follows (Note that for the task 3, I manually fix each and every OCR mismatches for fair comparison results):

Task Recall Precision Hmean Evaluation Method Model Parameters Model Size Weights
Task 1 97.52% 97.40% 97.46% Deteval CTPN 18,450,332 280 MB Last checkpoint
task 3 98.20% 98.48% 98.34% / CharLM 4,740,590 72.4 MB last checkpoint

User guide

Hardware

NVIDIA GPU (with tensor cores in order to use automatic mixed precision) + CUDNN if possible is strongly recommended. It's also possible to run the program on CPU only, but it will be extremely slow.

Besides, all the experiments and results were performed on my personal gaming computer:

  • Alienware Area-51m R2
  • 10th Gen Intel(R) Core(TM) i9 10900K (10-Core, 20 MB Cache, 3.7GHz to 5.3GHz w/Thermal Velocity Boost)
  • NVIDIA® GeForce® RTX 2070 SUPER™, 8Go GDDR6
  • OS: Dual boot Windows/Ubuntu 20.04

and DIVA GPU cluster:

  • 9th Gen Intel(R) Core(TM) i9 9900K (8-core, 16 MB cache)
  • NVIDIA® GeForce® GTX 1080
  • OS: Ubuntu 18.04

Conda environment setup

For Mac, Windows and Linux users, if conda is not installed, then you need to follow this documentation.

  1. Updating conda

    Before installation, we need to make sure conda is updated.

    conda update conda
    
  2. Creating an environment from a file

    conda env create -f env/environment.yml
    

    This will create a new conda environment named SROIE2019 on your system, which will give you all the packages needed for this repo. If you do not own any NVIDIA GPUs (with CUDA capable-system), then you must remove the cudatoolkit and cudnn lines in the environment.yml file. Otherwise, make sure your graphic card supports the installed version of CUDA.

  3. Activating the new environment

    conda activate SROIE2019
    

    or

    source activate SROIE2019
    

    If you want to deactivate the environment, you can simply do: conda deactivate

  4. Verify that the new environment was installed correctly

    conda env list
    

    or

    conda info --envs
    

for further info, you can check the manager .

Visualizer

To use Vidsom, you must make sure the server is running before you run the training.

Starting the server with

python3 -m visdom.server

or simply

visdom

Visdom interface

In your browser, you can go to:

http://localhost:8097

You will see the visdom interface:

visdom

One important thing to remember...You can launch the server with a specific port. Let's say: visdom --port 8198. But you need to make sure the Visualizer runs with the port 8198.

For further info on Visdom, you can check this: https://github.com/fossasia/visdom

Troubleshooting

If you have issues running or compiling this code, there are a list of common issues in TROUBLESHOOTING.md. If your issue is not present there, then feel free to open a new issue.

References

This repository is influenced by great works such as :

  • Luffic-SSD for the data augmentation and anchor matching parts.
  • eragonruan, courao , tranbahien for the implementation of the CTPN used to tackle the text localization.
  • eadst for the implementation for removing the extra white space in the scanned receipts for the text localization.
  • HephaestusProject for the implementation of Character-Aware Neural Language Models used to tackle the keyword-information extraction.

Citations

If you use this project in your research, please cite it as follows:

@misc{blackstar1313_sroie_2019,
  author = {Njoyim Tchoubith Peguy Calusha},
  title  = {ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction},
  year   = {2021},
  url    = {https://github.com/BlackStar1313/ICDAR-2019-RRC-SROIE}
}

TODO

Here is a to-do list which should be complete subsequently.

  • Support for the Docker images.
  • Support for the task 2: Scanned Receipts OCR.