-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build with pdf2htmlEX #380
Conversation
I'm gonna go with the conclusion that conan with msvc-1939 is broken:
Do we need msvc-1939 support for odr.core? Freetype is not our recipe, so this may eventually be solved without our intervention. We could set up a nice test case and report the issue to conan though |
Fontforge fails to compile with msvc-19.40. I've checked Fontforge and they do actually ship Windows binaries, although I'm unsure if they compile with msvc. Not sure if this is worth pursuing, conan-odr-index should be the first place where we add extra compilers or windows support. Disabling msvc compiler here. |
We don't really use the msvc build but I like to keep it because it picks on stuff the other compilers don't. Generally I think |
Brought back the msvc compilers, 19.39 and 19.40, just for good measure. Anyway, I would need some help integrating on the c++ side. With the current code on this PR, I assume the integration should happen somewhere in src/odr/html.cpp: Html html::translate(const PdfFile &pdf_file, const std::string &output_path,
const HtmlConfig &config) {
fs::create_directories(output_path);
#if WITH_PDF2HTMLEX
return pdf2htmlex_wrapper_which_returns_html_object(pdf_file, output_path, config);
#else
return internal::html::translate_pdf_file(pdf_file, output_path, config);
#endif
} Any other ideas? |
I think this makes a lot of sense yes. I don't think we need this runtime switchable. Even if we can change this in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know what we can do about the resource files @ViliusSutkus89 ? I guess conan can shit these to the platform and we can figure out something there?
I have some mechanism to transform resources into code so we don't need extra files. But I think that would need a rework to make it also work for pdf2htmlex
Resource files are ugly, I gather them in odr.droid/app/CMakeLists.txt and then tell gradle that asset merging has a dependency on CMake - https://github.com/opendocument-app/OpenDocument.droid/blob/d26811deef04dbba6d70b97c01967ec72557d641/app/build.gradle#L120 . I doubt it's possible to make the gathering somehow nicer. Runtime side could be nicer though. Currently we extract them right before calling pdf2htmlEX. Would be possible to move that to app init, but it would make app init extra slow, so I don't really want to go that route. Proper implementation would be to read assets in C++ directly from APK, using AssetManager, but I can't recall right now what was wrong with that route. Maybe it requires a high min API level. It definitely requires having access to the JNI Env object and AssetManager, obtained from Android's Context. These objects are bound to the specific Java thread, which does JNI, pdf2htmlEX doesn't do threading, I'm unsure about poppler. Anyway, this implementation requires some code. |
# Conflicts: # .github/workflows/build_test.yml # .github/workflows/publish.yml
# Conflicts: # conanfile.py
I've finally figured out why the tests were failing, turns out I've forgot to feed test VMs with ~/.conan2/p, which contains assets. Reference outputs are now generated and committed into appropriate repos. Naturally, I've had to run into another problem. Why is actions/checkout getting old submodule revisions? Do I need to manually update git revisions somewhere? Checkout with submodules should just work, right?
.gitmodules does not list any revisions:
OpenDocument.test-private
508.... is the third most recent commit in that repo. Commited on Dec 30, 2021. There are two more recents commits, made in summer of 2022. OpenDocument.testopendocument-app/OpenDocument.test checks the correct revision,
OpenDocument.test-private.output
OpenDocument.test.output
|
I think for now it would be fine to smoke test the two libraries. Just picking one input file for each of them and translate it to a random directly checking that it does not fail. For the ref comparison I would try to integrate it into the existing translation mechanism first |
We are way past the smoke test now. Ref testing is already implemented. It's just that a fresh git clone with recursive submodules clones wrong reference outputs and that's why the CI fails. Just checked and realized that a local git clone also has this behavior, so it's not GitHub actions issue, more of a git issue |
Turns out I have to update git submodules in the root project too |
Another issue:
Still no idea what that means |
Not sure if mixing the test file output between integrated and not integrated parts is a good idea. I see that adding the new API is the easiest solution for now but the new tests couple quite tightly to the new api and the existing test repos. |
I could put pdf2htmlEX/wvWare reference outputs in |
loaded_page_settling_time = 0 | ||
|
||
# Selenium doesn't like when we try to screenshot <body> element of documents generated by pdf2htmlEX | ||
if 'output-pdf2htmlEX' in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this looks ugly, but do we have a way to do this properly? I could add --pdf2htmlEX-workaround argument, but it would also need to be added to scripts/compare_output.py and pass through a bunch of functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is not great - do you know if the screenshot will somehow fail or give some deterministic broken result? I wondered if we could have a fallback instead of trying to detect who generated the PDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selenium fails to screenshot the element of pdf2htmlEX's htmls. An exception is thrown.
File "/home/runner/work/OpenDocument.core/OpenDocument.core/test/scripts/html_render_diff.py", line 23, in screenshot
png = body.screenshot_as_png
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 325, in screenshot_as_png
return b64decode(self.screenshot_as_base64.encode("ascii"))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 314, in screenshot_as_base64
return self._execute(Command.ELEMENT_SCREENSHOT)["value"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webelement.py", line 394, in _execute
return self._parent.execute(command, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/webdriver.py", line 347, in execute
self.error_handler.check_response(response)
File "/opt/hostedtoolcache/Python/3.12.5/x64/lib/python3.12/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: [Exception... "Failure" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: chrome://remote/content/shared/Capture.sys.mjs :: capture.canvas :: line 137" data: no]
Stacktrace:
capture.canvas@chrome://remote/content/shared/Capture.sys.mjs:137:62
takeScreenshot@chrome://remote/content/marionette/actors/MarionetteCommandsParent.sys.mjs:292:37
I've tried to solve it properly, but couldn't figure it out. Maybe because the generated body has too big dimensions? It works when I screenshot the inner <div id="page-container">
element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we catch the exception and then fall back to your solution?
…ATADIR setting mechanism
loaded_page_settling_time = 0 | ||
|
||
# Selenium doesn't like when we try to screenshot <body> element of documents generated by pdf2htmlEX | ||
if 'output-pdf2htmlEX' in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is not great - do you know if the screenshot will somehow fail or give some deterministic broken result? I wondered if we could have a fallback instead of trying to detect who generated the PDF
Let's get this in! I will try to follow up on some of my suggestions here #387 and will revisit the screenshotting there as well |
WIP. Currently just linking. Need to add actual usage