Table of Contents
As I was looking for a good Persian OCR, I've found out that there is no good open-source project that features Persian language for OCR. So I've started a project to create a simple Persian OCR to achieve the missing.
What I have Done:
- Optimize pytesseract for persian by testing different configs.
- Image Optimization for low-res images to improve accuracy significantly.
- Using a Persian Spell-Checking to improve accuracy.
Of course, This project isn't perfect and i'm still working on it to improve accuracy and speed. But I hope this project helps other people like me to have a good base for Persian OCR.
I have used python to build this project. Two of the most useful modules in this project were pytesseract and opencv.
This is a simple instruction to start using this project.
You need to install pytesseract on your device:
- Ubuntu
sudo apt-get install tesseract-ocr
You need to add Persian Language to tesseract:
- Ubuntu
sudo apt-get install tesseract-ocr-fas
Now that you've installed tesseract we can move on with Persian-OCR:_
- Clone the repo
git clone https://github.com/sepehrraisi/Persian-OCR && \ cd Persian-OCR
- Create a Virtual Environment for python and Source it:
python3 -m venv venv && \ source ./venv/bin/activate
- Install Python modules
requirements.txt
pip install -r requirements.txt
After installing the requirements you can use it by running the ocr.py
file:
python ./ocr.py -i <inputfile> -o <outputfile>
Then it will write the results to outputfile
- Use pytesseract to extract text
- Improve accuracy by simple opencv features
- Improve accuracy by UpScaling the images
- Add post-processing modules to improve accuracy
- Add modular capabilities to improve functionality
- Add Table recognition
- Multi-language Support
- Persian
- English
See the open issues for a full list of proposed features (and known issues).