-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added masterminds/html5 #831
Conversation
'height: auto' => ['<img src="logo.png" alt="" style="height: auto;">'], | ||
'width: auto' => ['<img src="logo.png" alt="" style="width: auto;">'], | ||
'height: auto' => ['<img src="logo.png" alt style="height: auto;">'], | ||
'width: auto' => ['<img src="logo.png" alt style="width: auto;">'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this valid HTML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on the DOCTYPE. It's not valid XHTML, it's valid HTML:
Attributes are placed inside the start tag, and consist of a name and a value, separated by an "=" character. The attribute value can remain unquoted if it doesn't contain ASCII whitespace or any of " ' ` = < or >. Otherwise, it has to be quoted using either single or double quotes. The value, along with the "=" character, can be omitted altogether if the value is the empty string.
<!-- empty attributes -->
<input name=address disabled>
<input name=address disabled="">
<!-- attributes with a value -->
<input name=address maxlength=200>
<input name=address maxlength='200'>
<input name=address maxlength="200">
https://html.spec.whatwg.org/multipage/introduction.html#intro-early-example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no reason to change this, the RNIB recommendation is alt=""
. We can add a separate test for alt
. We should not change the tests to allow the code to pass; we should fix the code to pass the tests - unless the tests are demonstrably wrong ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve looked into this a bit more.
Earlier versions of the HTML5 spec permitted the empty attribute syntax only for Boolean attributes. Later versions still don’t explicitly state that it is allowed for non-Boolean attributes, but they no longer say it is not (the specific clause, “This syntax is permitted only for boolean attributes,” has been removed).
Given that email clients are some way behind in their support for the specs (if they weren’t there’d be no need for Emogrifier), and many are probably still parsing HTML as HTML 4.01 or earlier, I would be reticent about outputting non-Boolean attributes with the empty attribute syntax, as it may cause issues on some clients.
Masterminds/html5-php
actually has rules defining non-Boolean attributes to be output with =""
but these do not kick in when the disable_html_ns
option is set. But if that option is not set, an xmlns
attribute is set on the root html
element and our XPath expressions fail to match elements by type.
It would be possible to extend OutputRules
to achieve the desired behaviour (which could also help in other cases, e.g. preventing "\r"
in the output on Windows); to do so would also require extending HTML5
to replace the save
method to instantiate an extended OutputRules
.
I note that the test that has picked this up is unrelated, which suggests some more tests in this area may be pertinent – although in some ways they would really be testing the behaviour of a third-party component (ext-dom
or Masterminds\html5-ptp
), albeit indirectly.
PS. DOMDocument
preserves whether non-Boolean or unrecognized attributes were specified with an explicit empty string or empty attribute syntax, apparently through DOMAttr
having a DOMText
child or not. Recognized Boolean attributes are always output with the empty attribute syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Masterminds/html5-php actually has rules defining non-Boolean attributes to be output with ="" but these do not kick in when the disable_html_ns option is set.
How much did you dig into that? Enough to submit a PR to them? Doesn't seem too obvious to me how disable_html_ns
affects the code from a quick glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OutputRules::nonBooleanAttribute
decides this.
If there is no namespace, it always returns false
, which means the attribute is considered Boolean and we get the empty attribute syntax.
However, if there is a namespace, as I'm sure you've found, our XPath expressions always fail because, e.g. xmlns:p
!= p
...
Hi @bytestream, thanks for the PR! ❤️ Could you please fix the PHPMD warning? Thanks! |
Thanks for checking! I'll look at this soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and substantial. Thanks for putting the effort in. I haven't been through with a fine-toothed comb, the new class I have barely looked at at all.
I have made a few comments.
I also wonder if we need some more test cases - some that are HTML4 or XHTML and some HTML5?
'height: auto' => ['<img src="logo.png" alt="" style="height: auto;">'], | ||
'width: auto' => ['<img src="logo.png" alt="" style="width: auto;">'], | ||
'height: auto' => ['<img src="logo.png" alt style="height: auto;">'], | ||
'width: auto' => ['<img src="logo.png" alt style="width: auto;">'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no reason to change this, the RNIB recommendation is alt=""
. We can add a separate test for alt
. We should not change the tests to allow the code to pass; we should fix the code to pass the tests - unless the tests are demonstrably wrong ;)
Incidentally, PHP8 promises some DOM improvements: https://wiki.php.net/rfc/dom_living_standard_api |
Yes, please. Particularly, I'd strongly prefer we get the new parser class in as a separate PR first (with good and complete tests). |
We have some tests already in I also agree with breaking this down into smaller PRs, and the new class seems a good place to start. |
In tests for `AbstractHtmlProcessor`, ignore whitespace differences in the outer `<html>` element. This will help with #831. Also added test that the `<body>` content passed to `fromHtml()` is preserved and returned by `render()`.
In tests for `AbstractHtmlProcessor`, ignore whitespace differences in the outer `<html>` element. This will help with #831. Also added test that the `<body>` content passed to `fromHtml()` is preserved and returned by `render()`.
In tests for `AbstractHtmlProcessor`, ignore whitespace differences in the outer `<html>` element. This will help with #831. Also added test that the `<body>` content passed to `fromHtml()` is preserved and returned by `render()`.
In tests for `AbstractHtmlProcessor`, ignore whitespace differences in the outer `<html>` element. This will help with #831. Also add test that the `<body>` content passed to `fromHtml()` is preserved and returned by `render()`.
Ensure that the DOCTYPE declaration consists of uppercase `DOCTYPE` and lowercase root element name (`html`). This is done when the `DOMDocument` is created from an HTML source. Once the `DOMDocument` has been created, the `DOMDocumentType` cannot be changed, so the document type declaration must be manipulated (if necessary) in the HTML beforehand. (Since only HTML documents are supported, the declaration is only normalized when the root element name is HTML, in whatever case - the precise specification for any element name involves lists of various Unicode character ranges which it would be superfluous to allow for and try to match. PHP's `DOMDocument`/`libxml` itself will output the `DOCTYPE` keyword in uppercase in any case.) This normalization is consistent with the relevant part of the [polyglot markup specification]( https://dev.w3.org/html5/html-polyglot/html-polyglot.html#doctype ). While polyglot markup is primarily intended for serialization of HTML as XML (we don't actually support outputting as XHTML), is also recommended for maximum interoperability and robustness when rendering HTML. This also makes the output consistent with that of `Masterminds/html5-php` and would eliminate the need to change associated tests specifically for #831.
Ensure that the DOCTYPE declaration consists of uppercase `DOCTYPE` and lowercase root element name (`html`). This is done when the `DOMDocument` is created from an HTML source. Once the `DOMDocument` has been created, the `DOMDocumentType` cannot be changed, so the document type declaration must be manipulated (if necessary) in the HTML beforehand. (Since only HTML documents are supported, the declaration is only normalized when the root element name is HTML, in whatever case - the precise specification for any element name involves lists of various Unicode character ranges which it would be superfluous to allow for and try to match. PHP's `DOMDocument`/`libxml` itself will output the `DOCTYPE` keyword in uppercase in any case.) This normalization is consistent with the relevant part of the [polyglot markup specification]( https://dev.w3.org/html5/html-polyglot/html-polyglot.html#doctype ). While polyglot markup is primarily intended for serialization of HTML as XML (we don't actually support outputting as XHTML), is also recommended for maximum interoperability and robustness when rendering HTML. This also makes the output consistent with that of `Masterminds/html5-php` and would eliminate the need to change associated tests specifically for #831. Closes #858.
Ensure that the DOCTYPE declaration consists of uppercase `DOCTYPE` and lowercase root element name (`html`). This is done when the `DOMDocument` is created from an HTML source. Once the `DOMDocument` has been created, the `DOMDocumentType` cannot be changed, so the document type declaration must be manipulated (if necessary) in the HTML beforehand. (Since only HTML documents are supported, the declaration is only normalized when the root element name is HTML, in whatever case - the precise specification for any element name involves lists of various Unicode character ranges which it would be superfluous to allow for and try to match. PHP's `DOMDocument`/`libxml` itself will output the `DOCTYPE` keyword in uppercase in any case.) This normalization is consistent with the relevant part of the [polyglot markup specification]( https://dev.w3.org/html5/html-polyglot/html-polyglot.html#doctype ). While polyglot markup is primarily intended for serialization of HTML as XML (we don't actually support outputting as XHTML), is also recommended for maximum interoperability and robustness when rendering HTML. This also makes the output consistent with that of `Masterminds/html5-php` and would eliminate the need to change associated tests specifically for #831. Closes #858.
@oliverklee @JakeQZ what needs to happen to get this moving again? |
@bytestream I propose adding the new class (without using it) and unit tests for it in a separate PR, and then (after that PR is merged) we can make this PR about using the new class. |
I've sent it upstream. Lets see if they accept it there, otherwise I'll send another PR here. |
Hi @bytestream, thanks for following-up on this. I’ve committed a couple of changes to address code changes that were required to the tests as part of this PR. The remaining issue pertains to the empty attribute syntax, for which I have created #867 for discussion of the best way forward. I’m also wondering about testing both the HTML5 code paths and HTML4 code paths (i.e. when the DOCTYPE is explicitly HTML4 or earlier) with the same set of tests (that we already have), as these will diverge with the introduction of |
I'll look at sending the new class your way in a separate PR tomorrow. Maintainers of |
As opposed to well-formed and perfectly valid HTML? |
Yes, that sounds feasible, and I would be okay with that. However, I'd like to change this to a configuration option (or some other deliberate way to trigger this, e.g. by injecting the HTML parser). I wouldn't like Emogrifier automatically change its behavior in a project if some other dependency pulls in
Even better: We can have different tests for both cases, or a flexible way to test this (which might be too complex - I haven't fully though this through). |
I agree. I think I alluded to that concern in my previous comment, and this suggestion would satisfactorily allay that concern for me.
I have now remembered what the issue with PHPUnit was that meant that the simple change I referenced above (to run the |
@JakeQZ it sounds like https://phpunit.readthedocs.io/en/9.2/annotations.html#requires is what you're after? If it needs to be done for individual cases within a
Just an public function enableHtml5()
{
if (! class_exists(Html5::class)) { throw new Exception("Add masterminds/html5 as a project dependency"); }
$this->html5 = true;
} |
Any feedback on the above? |
ping @JakeQZ |
Not sure what you are waiting for from me here. I thought we had agreed for this to be implemented as an option that must be specifically enabled (i.e. not just a There are a few testing-related bits and bobs we still ought to implement, e.g. in the area of valueless attributes, but I don't think the lack of those is necessarily blocking this – it just could make it slightly more awkward because we would have to disable some tests for the Masterminds case. But we would have to disable some tests for the Masterminds case anyway.
Oh, sorry, were you waiting for me to reply to this? I kind of read it as a comment, and thought "that sounds good" - to myself :/ Um, that sounds good. If it works, and solves the dual testing issue with the ability to vary the provided data based on with/without Masterminds (which we need because, AFAICS, we need to drop specific datasets depending on whether Masterminds is used), then, yes please, let's go for it. I'm prepared to accept that certain tests will fail in the Masterminds case (you could also add tests that would fail conversely), as long as it's clear, and the issues involved are minor if not trivial or non-consequental (as I think most probably are). Look forward to a PR... :) |
PS. You could probably split this into two PRs - the first one being for the dual tests, with the option to enable Masterminds which does nothing in that first PR (other than to allow the tests to run both with and without that option set)... PPS. I've added #921, which might have some relevance for the testing side of this - whatever they are doing, i.e. the approach they are taking, to combine data providers might also suggest a way of implementing re-running |
Any comments on what I just pushed? This is the simplest way of doing it. I've added I suspect what you really wanted to see is the below. If this is the approach you want then let me know but this is significantly more work... AbstractHtmlProcessor: public function __construct(?HTML5 $parser = null)
{
$this->html5 = $parser;
} AbstractTest: public function fooTest(): void
{
$this->instance->doSomething();
} DomDocumentTest: public function setUp(): void
{
$this->instance = new HtmlProcessor();
} Html5Test: public function setUp(): void
{
$this->instance = new HtmlProcessor(
new HTML5([/* opts */])
);
} |
I like the solution to the testing issue, by moving it to the Composer / GitHub Action level to run the tests twice. I hadn't thought of that, but it should make it much easier to do. In terms of interface for usage, I was thinking that this would be a configration level option, and could be implemented similarly to how it is done in PHPWord for these sort of options, rather than an extra parameter. @oliverklee, WDYT? You are right that $domDocument = CssInliner::fromHtml($html)->inlineCss($css)->getDomDocument();
HtmlPruner::fromDomDocument($domDocument)->removeElementsWithDisplayNone();
$html = CssToAttributeConverter::fromDomDocument($domDocument)
->convertCssToVisualAttributes()->render(); becomes $htmlProcessor = CssInliner::fromHtml($html)->inlineCss($css);
HtmlPruner::fromHtmlProcessor($htmlProcessor)->removeElementsWithDisplayNone();
$html = CssToAttributeConverter::fromHtmlProcessor($htmlProcessor)
->convertCssToVisualAttributes()->render(); Such a change could be implmented as a separate PR ahead of other changes. When it comes to merging to main (hopefully soon!), it would be good to split this up in some reasonable logical way into several separate changesets, so it is more manageable and any issues can be more easily isolated. |
This sounds better than
because
It would mean |
I don't follow. The two are independent. There still needs to be a means of deciding whether to use |
As I mentioned I think |
I see your point, though if only using I can also see that having an optional Boolean parameter to |
I think so |
In that case I would be generally in favour of this approach. (This is proposed as an optional paramter with a default I have thus far only glanced through the changes, and not used a fine-toothed comb. If there is a way of breaking things down into separate PRs, that would be helpful, though my previous suggestion in this department would be no longer valid (if we are not going to have a global configuration option). As mentioned, the only thing I think needs to be added is an @oliverklee WDYT? |
I'm really sorry for being so late for the party. :-( My thoughts on this:
What do you think? |
I agree in priciniple but fear the changes you're suggesting require a lot of work.
The env var is only really there to minimise the numbers of changes required to get the test suite to run on masterminds/html5
I may be overcomplicating this but I don't understand the concept of an abstract parser in this case. The entire library is based around |
I will switch the default git branch from |
Resolves #828