Skip to content

Kyosukez/Document-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document-OCR

Japanese OCR utlizing python inorder to read and export text and data

  • 動作環境
    • OS : Windows10
    • Python : 3.10.0
    • Tesseract : 5.3.3
    • pyocr : 0.8.5
    • PIL : 9.3.0
    • Poppler : 23.11.0

Usage

You can use this project by cloning this reposetory and running it with your IDE of choice.

You will need to install the following components inorder to run the code;

Tesseract for Windows

I recommend following this tutorial:  ひつじ

Change the Engine for tesseract to the Best version over the Fast version

※日本語Best版は下から落とす

https://github.com/tesseract-ocr/tessdata_best/blob/main/jpn.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/main/jpn_vert.traineddata

※これをTesseractーOCR>>tessdataの中身と上書きする スクリーンショット 2023-11-01 203210

pip install pillow
pip install pyocr
pip install 

For the PDF to Image conversion you will need the library Poppler

Installation

Download Latest Version of Poppler Here

Instructions for PATH here

Important

This is a prototype at best, do not expect everything to work perfectly.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages