Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature/Patch] Create seachable PDF with tesseract #58

Open
minamotorin opened this issue Sep 4, 2021 · 2 comments
Open

[Feature/Patch] Create seachable PDF with tesseract #58

minamotorin opened this issue Sep 4, 2021 · 2 comments

Comments

@minamotorin
Copy link

Hi. GoBooDo is great, but if it create searchable PDF, it will get more great. I wrote the codo to do this.

Description

This patch add ocrPDF method to createBook class. This method apply OCR via pytesseract to images and create PDF via PyPDF2. I couldn't find the method to merge PDF with fpdf.

ocrPDF catchs lang keyword for which languages OCR reads in.

Usage

An additional package named PyPDF2 is required: pip3 install pypdf2.

  • When lang of settings.json is empty, this program create unsearchable PDF without OCR (same as before patch).
  • When lang of settings.json is languages (e.g. "eng+ita"), this program create searchable PDF with OCR. Languages which OCR reads in is lang of settings.json ("eng+ita" means the book is written by English and Italian).

Languages other than English is not included in tesseract by standard, and users should download language data from tesseract-ocr and put it on appropriate path.

For more information about OCR, see tesseract document.

Code

I wrote the code to do this. However, this repository is no lisence and it is ilegal to modify the original code of GoBooDo (#54). So I send patch instead.

If added license to this repository or I allowed to modify original code for pull request, I'll send the pull request.

The patch is as follows. Some English sentence should get feedbacks:

diff --git a/GoBooDo.py b/GoBooDo.py
index 419fcb1..dcd072e 100644
--- a/GoBooDo.py
+++ b/GoBooDo.py
@@ -147,7 +147,10 @@ class  GoBooDo:
         downloadService.getImages(settings['max_retry_images']+1)
         print('------------------- Creating PDF -------------------')
         service = createBook(self.name, self.path)
-        service.makePdf()
+        if (settings.get('lang')):
+            service.ocrPdf(lang=settings['lang'])
+        else:
+            service.makePdf()
 
     def start(self):
         try:
@@ -222,4 +225,4 @@ ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
             sleep(retry_time)
     else:
         book = GoBooDo(args.id)
-        book.start()
\ No newline at end of file
+        book.start()
diff --git a/README.md b/README.md
index 0b6b022..335244e 100644
--- a/README.md
+++ b/README.md
@@ -34,8 +34,9 @@ The configuration can be done in the settings.json and the description is as fol
   "proxy_links":0,   // 0 for disabling proxy when fetching page links upon reaching the limit.
   "proxy_images":0,  // 0 for disabling proxy when fetching  page images upon reaching the limit.
   "max_retry_links":1, // Max retries for fetching a link using proxies.
-  "max_retry_images":1 // Max retries for a fetching a image using proxies.
-  "global_retry_time": // 0 for not running GoBooDo indefinitely, the number of seconds of delay between each global retry otherwise.
+  "max_retry_images":1, // Max retries for a fetching a image using proxies.
+  "global_retry_time":30, // 0 for not running GoBooDo indefinitely, the number of seconds of delay between each global retry otherwise.
+  "lang": "" // "" for create PDF without OCR, languages which OCR reads in. E.g. "eng+ita".
 }
 ~~~
 
@@ -63,8 +64,11 @@ fpdf
 html5lib
 tqdm
 pytesseract
+pypdf2
 ~~~
 
+If you want to use OCR with languages other than English, you should download aditional languages data from [tesseract-ocr](https://github.com/tesseract-ocr).
+
 # Features 
 1. Stateful : GoBooDo keeps a track of the books which are downloaded. In each subsequent iterations of operation only those those links and images are fetched which were not downloaded earlier.
 2. Proxy support : Since Google limits the amount of pages accessible to each individual majorly on the basis of IP address, GoBooDo uses proxies for circumventing that limit and maximizing the number of pages that can be accessed in the preview.
diff --git a/makePDF.py b/makePDF.py
index 9fd134d..bc37cd5 100644
--- a/makePDF.py
+++ b/makePDF.py
@@ -2,6 +2,9 @@ from fpdf import FPDF
 import os
 from PIL import Image
 from tqdm import tqdm
+from pytesseract import image_to_pdf_or_hocr
+import PyPDF2
+from io import BytesIO
 
 class createBook:
 
@@ -22,4 +25,18 @@ class createBook:
             os.mkdir(os.path.join(self.path,'Output'))
         name = str(self.name[:min(10,len(self.name))]).replace(" ","")
         name = ''.join(ch for ch in name if ch.isalnum()) + ".pdf"
-        pdf.output(os.path.join(self.path,'Output',name),"F")
\ No newline at end of file
+        pdf.output(os.path.join(self.path,'Output',name),"F")
+
+    def ocrPdf(self, lang=None):
+        pdf = PyPDF2.PdfFileWriter()
+        for pagePath in tqdm(self.imageNameList):
+            with open(pagePath, 'rb') as ofile:
+                im = Image.open(ofile)
+                page = image_to_pdf_or_hocr(im, lang=lang)
+            pdf.addPage(PyPDF2.PdfFileReader(BytesIO(page)).getPage(0))
+        if not os.path.exists(os.path.join(self.path,'Output')):
+            os.mkdir(os.path.join(self.path,'Output'))
+        name = str(self.name[:min(10,len(self.name))]).replace(" ","")
+        name = ''.join(ch for ch in name if ch.isalnum()) + ".pdf"
+        with open(os.path.join(self.path,'Output',name),'wb') as ofile:
+            pdf.write(ofile)
diff --git a/requirements.txt b/requirements.txt
index 5dbea7b..bcc11f9 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,4 +4,5 @@ Pillow
 fpdf
 html5lib
 tqdm
-pytesseract
\ No newline at end of file
+pytesseract
+pypdf2
diff --git a/settings.json b/settings.json
index 2548700..c29e77c 100644
--- a/settings.json
+++ b/settings.json
@@ -6,5 +6,6 @@
     "proxy_images": 0,
     "max_retry_links": 1,
     "max_retry_images": 1,
-    "global_retry_time": 30
-}
\ No newline at end of file
+    "global_retry_time": 30,
+    "lang": ""
+}

The part which I wrote of this patch is CC0.

@minamotorin minamotorin changed the title [Feature] Create seachable PDF with tesseract [Feature/Patch] Create seachable PDF with tesseract Sep 4, 2021
@vaibhavk97
Copy link
Owner

Hello @minamotorin thanks for you intersest in development of GoBooDo, the License has been updates, please consider submitting a pull request.

@minamotorin
Copy link
Author

@vaibhavk97 Thanks for response! I'll submit a pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants