Japanese OCR utlizing python inorder to read and export text and data
- 動作環境
- OS : Windows10
- Python : 3.10.0
- Tesseract : 5.3.3
- pyocr : 0.8.5
- PIL : 9.3.0
- Poppler : 23.11.0
You can use this project by cloning this reposetory and running it with your IDE of choice.
You will need to install the following components inorder to run the code;
I recommend following this tutorial: ひつじ
Change the Engine for tesseract to the Best version over the Fast version
※日本語Best版は下から落とす
https://github.com/tesseract-ocr/tessdata_best/blob/main/jpn.traineddata
https://github.com/tesseract-ocr/tessdata_best/blob/main/jpn_vert.traineddata
※これをTesseractーOCR>>tessdataの中身と上書きする
pip install pillow
pip install pyocr
pip install
For the PDF to Image conversion you will need the library Poppler
Download Latest Version of Poppler Here
Instructions for PATH here
Important
This is a prototype at best, do not expect everything to work perfectly.