You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi. GoBooDo is great, but if it create searchable PDF, it will get more great. I wrote the codo to do this.
Description
This patch add ocrPDF method to createBook class. This method apply OCR via pytesseract to images and create PDF via PyPDF2. I couldn't find the method to merge PDF with fpdf.
ocrPDF catchs lang keyword for which languages OCR reads in.
Usage
An additional package named PyPDF2 is required: pip3 install pypdf2.
When lang of settings.json is empty, this program create unsearchable PDF without OCR (same as before patch).
When lang of settings.json is languages (e.g. "eng+ita"), this program create searchable PDF with OCR. Languages which OCR reads in is lang of settings.json ("eng+ita" means the book is written by English and Italian).
Languages other than English is not included in tesseract by standard, and users should download language data from tesseract-ocr and put it on appropriate path.
I wrote the code to do this. However, this repository is no lisence and it is ilegal to modify the original code of GoBooDo (#54). So I send patch instead.
If added license to this repository or I allowed to modify original code for pull request, I'll send the pull request.
The patch is as follows. Some English sentence should get feedbacks:
diff --git a/GoBooDo.py b/GoBooDo.py
index 419fcb1..dcd072e 100644
--- a/GoBooDo.py+++ b/GoBooDo.py@@ -147,7 +147,10 @@ class GoBooDo:
downloadService.getImages(settings['max_retry_images']+1)
print('------------------- Creating PDF -------------------')
service = createBook(self.name, self.path)
- service.makePdf()+ if (settings.get('lang')):+ service.ocrPdf(lang=settings['lang'])+ else:+ service.makePdf()
def start(self):
try:
@@ -222,4 +225,4 @@ ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
sleep(retry_time)
else:
book = GoBooDo(args.id)
- book.start()
\ No newline at end of file
+ book.start()diff --git a/README.md b/README.md
index 0b6b022..335244e 100644
--- a/README.md+++ b/README.md@@ -34,8 +34,9 @@ The configuration can be done in the settings.json and the description is as fol
"proxy_links":0, // 0 for disabling proxy when fetching page links upon reaching the limit.
"proxy_images":0, // 0 for disabling proxy when fetching page images upon reaching the limit.
"max_retry_links":1, // Max retries for fetching a link using proxies.
- "max_retry_images":1 // Max retries for a fetching a image using proxies.- "global_retry_time": // 0 for not running GoBooDo indefinitely, the number of seconds of delay between each global retry otherwise.+ "max_retry_images":1, // Max retries for a fetching a image using proxies.+ "global_retry_time":30, // 0 for not running GoBooDo indefinitely, the number of seconds of delay between each global retry otherwise.+ "lang": "" // "" for create PDF without OCR, languages which OCR reads in. E.g. "eng+ita".
}
~~~
@@ -63,8 +64,11 @@ fpdf
html5lib
tqdm
pytesseract
+pypdf2
~~~
+If you want to use OCR with languages other than English, you should download aditional languages data from [tesseract-ocr](https://github.com/tesseract-ocr).+
# Features
1. Stateful : GoBooDo keeps a track of the books which are downloaded. In each subsequent iterations of operation only those those links and images are fetched which were not downloaded earlier.
2. Proxy support : Since Google limits the amount of pages accessible to each individual majorly on the basis of IP address, GoBooDo uses proxies for circumventing that limit and maximizing the number of pages that can be accessed in the preview.
diff --git a/makePDF.py b/makePDF.py
index 9fd134d..bc37cd5 100644
--- a/makePDF.py+++ b/makePDF.py@@ -2,6 +2,9 @@ from fpdf import FPDF
import os
from PIL import Image
from tqdm import tqdm
+from pytesseract import image_to_pdf_or_hocr+import PyPDF2+from io import BytesIO
class createBook:
@@ -22,4 +25,18 @@ class createBook:
os.mkdir(os.path.join(self.path,'Output'))
name = str(self.name[:min(10,len(self.name))]).replace(" ","")
name = ''.join(ch for ch in name if ch.isalnum()) + ".pdf"
- pdf.output(os.path.join(self.path,'Output',name),"F")
\ No newline at end of file
+ pdf.output(os.path.join(self.path,'Output',name),"F")++ def ocrPdf(self, lang=None):+ pdf = PyPDF2.PdfFileWriter()+ for pagePath in tqdm(self.imageNameList):+ with open(pagePath, 'rb') as ofile:+ im = Image.open(ofile)+ page = image_to_pdf_or_hocr(im, lang=lang)+ pdf.addPage(PyPDF2.PdfFileReader(BytesIO(page)).getPage(0))+ if not os.path.exists(os.path.join(self.path,'Output')):+ os.mkdir(os.path.join(self.path,'Output'))+ name = str(self.name[:min(10,len(self.name))]).replace(" ","")+ name = ''.join(ch for ch in name if ch.isalnum()) + ".pdf"+ with open(os.path.join(self.path,'Output',name),'wb') as ofile:+ pdf.write(ofile)diff --git a/requirements.txt b/requirements.txt
index 5dbea7b..bcc11f9 100644
--- a/requirements.txt+++ b/requirements.txt@@ -4,4 +4,5 @@ Pillow
fpdf
html5lib
tqdm
-pytesseract
\ No newline at end of file
+pytesseract+pypdf2diff --git a/settings.json b/settings.json
index 2548700..c29e77c 100644
--- a/settings.json+++ b/settings.json@@ -6,5 +6,6 @@
"proxy_images": 0,
"max_retry_links": 1,
"max_retry_images": 1,
- "global_retry_time": 30-}
\ No newline at end of file
+ "global_retry_time": 30,+ "lang": ""+}
Hi. GoBooDo is great, but if it create searchable PDF, it will get more great. I wrote the codo to do this.
Description
This patch add
ocrPDF
method tocreateBook
class. This method apply OCR viapytesseract
to images and create PDF viaPyPDF2
. I couldn't find the method to merge PDF withfpdf
.ocrPDF
catchslang
keyword for which languages OCR reads in.Usage
An additional package named
PyPDF2
is required:pip3 install pypdf2
.lang
ofsettings.json
is empty, this program create unsearchable PDF without OCR (same as before patch).lang
ofsettings.json
is languages (e.g."eng+ita"
), this program create searchable PDF with OCR. Languages which OCR reads in islang
ofsettings.json
("eng+ita"
means the book is written by English and Italian).Languages other than English is not included in tesseract by standard, and users should download language data from tesseract-ocr and put it on appropriate path.
For more information about OCR, see tesseract document.
Code
I wrote the code to do this. However, this repository is no lisence and it is ilegal to modify the original code of GoBooDo (#54). So I send patch instead.
If added license to this repository or I allowed to modify original code for pull request, I'll send the pull request.
The patch is as follows. Some English sentence should get feedbacks:
The part which I wrote of this patch is CC0.
The text was updated successfully, but these errors were encountered: