Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epub: detect scripted content and improve mathml, svg detection #1659

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xworld21
Copy link
Contributor

There are situations where an EPUB might contain scripted content (e.g. an interactive SVG, or embedding MathJax). Scripted content must be reported in the manifest with the scripted property.

The PR cleans up the existing code setting the mathml and svg properties, plus it sets the scripted property when needed. This also means parsing the SVG files to find if they contain scripts or MathML content.

(Unfortunately this causes a trivial conflict with #1643.)

push @properties, 'svg' if $doc->findnode('//*[local-name() = "svg"]');
push @properties, 'scripted' if $doc->findnode($xpath_scripted . ' | ' . $xpath_forms);
push @properties, 'mathml' if $doc->findnode($xpath_mathml);
push @properties, 'svg' if $doc->findnode($xpath_svg);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes assume we'll only ever have xhtml in epub. It's at least something to watch out for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is easy to explain: the EPUB specs (both 2 and 3 as far as I know) require HTML5 in XML serialisation, with the appropriate namespaces. So this code should be ok, and more accurate, actually.

my $properties = join(" ", @properties);
$file_item->setAttribute('properties', $properties) if $properties; } }

}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is basically what we'd want to determine whether to use img or object, as discussed in #1444, right? That suggests that it might be better higher up the chain, such as Post::Graphics?

Copy link
Contributor Author

@xworld21 xworld21 Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The img/object distinction is a bit more subtle because the <img> tag disables every external resource as well. I'd say this code is specific to EPUB (the spec just wants to know if there are scripts and MathML content*).

* Actually, one should also declare external resources, e.g. web fonts not embedded in the file. This requires parsing the CSS content of the image, so it seems quite hard to do reliably.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the harm in simply assuming that any svg image contains mathml and scripts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per spec, the mathml and scripted properties "MUST be set if and only if the criterion specified in the description is met", so the output would be invalid. To my memory, epubcheck was flagging the properties as errors if no mathml/scripts were in the file.

@@ -15,6 +15,25 @@ use warnings;
use File::Find qw(find);
use URI::file;

# HTML5 namespaces
our $XPATH = $LaTeXML::Post::Document::XPATH;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing of this is off; it will set $XPATH to the current value when LaTeXML::Post::Manifest::Epub is loaded, rather than when the document is being packaged. That will probably work for a one-off conversion because LaTeXML has defered but will not (necessarily) be correct in more general (eg. daemon) contexts.

The problem is that we haven't got a clear clean API for managing namespaces so that we can declare these mappings once, and use them consistently throughout conversion & postprocessing (ideally, they should only make it into the otuput document if they're needed).

Short of that, it would be best to move these registerNS into one of the Epub methods; probably initialize.

I'm trying to get some of these PR's moving again. @dginev do you have any comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved most of the $XPATH code to initialize. Once you reach initialize, XPATH should already exists, right?

@xworld21
Copy link
Contributor Author

Updated to work with latest Epub.pm updates on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants