Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: progress callback for ocrmypdf (background usage) #1451

Closed
QuentinFuxa opened this issue Dec 31, 2024 · 5 comments
Closed

[Feature]: progress callback for ocrmypdf (background usage) #1451

QuentinFuxa opened this issue Dec 31, 2024 · 5 comments
Assignees

Comments

@QuentinFuxa
Copy link
Contributor

Describe the proposed feature

Description
Hi! I’d like to request a feature that provides a progress callback mechanism. Specifically, I’m interested in being able to run ocrmypdf.ocr() in the background (e.g. via FastAPI in a separate thread) and receive real-time updates on its progress.

I noticed there was some discussion in issue #511, mentioning a plugin approach and a link to plugins docs, but that link seems to be broken now.

Proposal

  • Add an optional parameter, for example progress_callback, to ocrmypdf.ocr(...).
  • Whenever OCRmyPDF processes a page or updates progress, it calls this callback with data such as:
    def progress_callback(step: str, percent: float, message: str): #The parameters would need to be better defined, by looking at what is available
        # For example, update an external progress tracker...
        ```

This would facilitate showing progress in real time without parsing logs or CLI output.
If you think that could be useful, I'm open to work on it. Alternatively, if you think implementing it via the plugin system is better suited, I’m open to working on that too

Why this is useful

  • Many users (or at least me for different projects) run OCRmyPDF on the server side (e.g. with Flask/FastAPI) and want to provide live progress updates to clients.
  • A built-in callback or plugin system for progress would be more robust than manually parsing logging or stdout.

I’m happy to contribute a pull request if this aligns with the project’s vision. Let me know what you think, and thank you for creating such a useful tool!

@QuentinFuxa
Copy link
Contributor Author

Here an example of how it would be integrated in an API server for instance :

progress_dict = {}

@ocr_router.post("/ocr-pdf")
async def ocr_pdf(file: UploadFile = File(...), name: str = Form(None)):
    name = name.rsplit(".", 1)[0]

    input_pdf_path = f"{UPLOAD_FOLDER}/{name}.pdf"
    output_pdf_path = f"{UPLOAD_FOLDER}/{name}_ocr.pdf"
    job_id = str(uuid.uuid4())
    
    def _background_ocr():
        try:
            progress_dict[job_id] = 0.0

            def my_progress_callback(step, percent, message):
                progress_dict[job_id] = percent

            ocrmypdf.ocr(
                input_pdf_path,
                output_pdf_path,
                force_ocr=True,
                progress_callback=my_progress_callback
            )

            progress_dict[job_id] = 100.0

        except Exception as e:
            logger.exception(f"OCR error: {e}")
            progress_dict[job_id] = -1

    thread = threading.Thread(target=_background_ocr)
    thread.start()
    return {"job_id": job_id}```

@jbarlow83
Copy link
Collaborator

This functionality already exists. You can replace the standard progress bar with something that notifies another progress monitor.

Over here, give or take:

https://github.com/ocrmypdf/OCRmyPDF/blob/main/src/ocrmypdf/pluginspec.py#L143

@github-actions github-actions bot removed the triage Issue needs triage label Dec 31, 2024
@jbarlow83
Copy link
Collaborator

So yes, I think it would be better to use the existing hook for this. Something seems to be wrong with the documentation build - if you look at the source files for the documentation there is a more detailed explanation.

@QuentinFuxa
Copy link
Contributor Author

Thank you!
For the documentation, autofunction works when building locally using sphinx-build 7.4.7.

image

@QuentinFuxa
Copy link
Contributor Author

For anybody interested in how to use the progress bar in a frontend, here I demonstrate how to use the plugin and store and send the progress bar to an html page using REST endpoints

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants