Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add PPTX notes slides #474

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

maciejwie
Copy link

Presenter notes are a valuable part of a Powerpoint presentation and are worth extracting. Docling uses uses the python-pptx library for parsing Powerpoint pptx files, which supports reading from the presenter notes, and which are stored as notes slides.

Issue resolved by this Pull Request:
Resolves #473

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Presenter notes may have useful information and should also be extracted.

Signed-off-by: Maciej Wieczorek <maciej@wieczorek.co>
Copy link

mergify bot commented Nov 29, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@PeterStaar-IBM
Copy link
Contributor

@maciejwie I like your PR a lot, I just think we need an ability to distinguish between the regular text and the (invisible) text/notes.

My proposal is to first merge this (DS4SD/docling-core#80) and then update this PR to explicitly tag it as InvisibleTextItem with the correct label.

@maciejwie
Copy link
Author

Hi @PeterStaar-IBM, sounds good. I'm subscribed to that PR and see there's still some discussion about it, and when it gets merged I will update this one.

@PeterStaar-IBM
Copy link
Contributor

@maciejwie I want to go fast on this. We are just merging in now a few big performance improvements (10x faster pdf-parsing, improved layout postprocessing and GPU acceleration). Once done, we will update this one together with another PR: basically, we would like to add the author notes as part of the furniture (yes, this is the correct term: https://en.wikipedia.org/wiki/Page_layout, I was also surprised).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Powerpoint notes slides
2 participants