Skip to content

HT0710/Receipt-Information-Extraction

Repository files navigation

Receipt Information Extraction (RIE)

Project completed on 21/03/2023

This project is inspired by MC-OCR, more information about this competition can access here: Link

The dataset is contain more than 1500 receipts from the competition and our collected, can be download here: Google Drive

ex_2


Table of Contents

Introduction

Receipt Information Extraction (RIE) is a task that involves extracting structured data from unstructured receipts. The goal is to identify and extract key information such as date, time, total amount, tax amount, items purchased, etc. from receipts in various formats and languages. This task can be useful for applications such as expense management, accounting, fraud detection and analytics.

pipeline

Install

Download the project

  1. Click here: RIE
  2. Unzip the main.zip file
  3. Navigate into the project folder:
cd path-to-the-project-folder

Clone with git: (NOT RECOMMENDED)

# very slow download speed
git clone https://github.com/HT0710/Receipt-Information-Extraction
cd Receipt-Information-Extraction

Using wget on Linux:

# faster download speed
wget https://github.com/HT0710/Receipt-Information-Extraction/archive/refs/heads/main.zip
unzip main.zip
cd Receipt-Information-Extraction-main

Version

  • Python 3.8

Using Conda:

# Installation: https://docs.conda.io/en/latest/miniconda.html
conda create -n rie python=3.8
conda activate rie

Requirements

  • rembg = 2.0.30
  • torch = 1.11
  • torchvision = 0.12
  • opencv-python = 4.7.0.72
  • scikit-learn = 1.2.1
  • scikit-image = 0.19.3
  • scipy = 1.9.3
  • imutils = 0.5.4
  • PyYAML = 6.0
  • einops = 0.6.0
  • gdown = 4.6.4
pip install -r requirements.txt

Note: CUDA is required if you want to use GPU. You can follow my instructions here

Usage

One-time run

Modify the configurations in config.yaml then run:

python run.py

With CLI:

python run.py -h
  • -i: Image path or Folder path
  • -o: Output folder path
  • -g: Which gpu to run | 0 for cpu | -1 for all (Default: -1)
  • -mp: Maximum of cpu can use | -1 for 80% of your cpu (Default: -1)

Caution: Using 100% of your cpu may crash your system!

Example:

python run.py -i data/test/test_1.jpg -o output -g 0 -mp 10

More configurations can access in config.yaml

Each step run

1. Remove background

  • Remove the image background
  • Input and output folder can be modify in background_remove.py
  • Note: only run with folder input

Execute:

python background_remove.py

2. Rotate

  • Rotate horizontal, invert and align straight
  • Input and output folder can be modify in rotate.py
  • Note: only run with folder input

Execute:

python rotate.py

3. Extract information

  • Extract the receipt information
  • Input and output can be modify in extract_info.py
  • Note: only run with single image input, if you want extract a folder please use run.py

Execute:

python extract_info.py

Results

ex_1

Citations

You can find the paper here: RIE

License

This project is licensed under the MIT License. See LICENSE for more details.

References

Contact

Open an issue: New issue

Mail: pthung7102002@gmail.com