-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper normalize attribute value normalization #379
base: master
Are you sure you want to change the base?
Conversation
e45064f
to
401bb77
Compare
Suggestions :
TODO:
|
401bb77
to
9307786
Compare
538e5cd
to
ac7b67b
Compare
Codecov Report
@@ Coverage Diff @@
## master #379 +/- ##
==========================================
+ Coverage 61.37% 61.48% +0.10%
==========================================
Files 20 20
Lines 10157 10229 +72
==========================================
+ Hits 6234 6289 +55
- Misses 3923 3940 +17
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I plan to continue working on this over the next week or two |
f206a71
to
1a138d6
Compare
@Mingun Questions: Currently we have the functions
|
|
I'm basically going off of the lack of any kind of discussion of attribute value normalization in the HTML living spec, and this discussion on stackoverflow https://html.spec.whatwg.org/multipage/dom.html#attributes
I agree
Can they? It looks like all these fields are private. But, I feel like this is still the best option. There is already The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate. |
08c0eea
to
00a37a0
Compare
src/events/attributes.rs
Outdated
let codepoint = escapei::parse_number(entity, idx..end)?; | ||
escapei::push_utf8(&mut normalized, codepoint); | ||
} else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) { | ||
// TODO: recursively apply entity substitution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mingun Does the normal unescape()
function need to do this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so
src/events/attributes.rs
Outdated
/// This will allocate unless the raw attribute value does not require normalization. | ||
/// | ||
/// See also [`normalized_value_with_custom_entities()`](#method.normalized_value_with_custom_entities) | ||
pub fn normalized_value(&'a self) -> Result<Cow<'a, [u8]>, EscapeError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it will be valuable to add examples here. Actually, probably you can convert your new tests to doc examples and that will be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we should add a decoder parameter to that (and similar) functions and decode first before normalization.
Maybe also try another approach: introduce a new type of always UTF-8 encoded attributes and add a Attribute::decode(&self) -> Result<Utf8Attribute>
function.
Instead of completely new type we could try to use const bool
generic parameter (stable since 1.51, current MSRV is 1.41.1, from memchr
)
src/events/attributes.rs
Outdated
let codepoint = escapei::parse_number(entity, idx..end)?; | ||
escapei::push_utf8(&mut normalized, codepoint); | ||
} else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) { | ||
// TODO: recursively apply entity substitution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can they? It looks like all these fields are private.
No, they can't. I confused with the ability to disable with_checks
in the middle of an iteration.
But, I feel like this is still the best option. There is already
attributes()
andhtml_attributes()
, it's only a matter of changing the types.
Ok, then let's do that.
The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.
If you talk about make_normalized_value
, you can convert it to a free function and use it in both API methods.
c0f0577
to
34c2f72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed the algorithm and other noted things in my branch, but I do not yet consider it as finished.
benches/microbenches.rs
Outdated
|
||
group.bench_function("noop_long", |b| { | ||
b.iter(|| { | ||
criterion::black_box(unescape("just a bit of text without any entities")).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe made long really long? A 1KB at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope that 1kb attribute values isn't a common thing. But you're right that it should be a bit longer.
34c2f72
to
3874791
Compare
I did a cursory review of the changes and it looks fine. Would you prefer the commits squashed or kept separate? I'll address everything else tomorrow. |
I prefer to kept separate. It is somehow psychologically uncomfortable for review when one commit changes more than ~200 lines in each (or just several) files, even if half of them -- new tests. ¯_(ツ)_/¯ I think, that at least separating normalization method to it's own commit would be a good idea. This is pretty isolated thing which, however, is big enough. That commit needed to be updated:
|
7f55cd8
to
add31b6
Compare
add31b6
to
a72a441
Compare
f317d76
to
69a1934
Compare
69a1934
to
ff42db2
Compare
ff42db2
to
deed851
Compare
deed851
to
def940d
Compare
def940d
to
5817baf
Compare
How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider). Should we consider keeping the state in
Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached. Should that be two separate errors in your opinion, or one error? |
@Mingun ^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).
Most naturally would be have a new option in reader::Config
. That mean, we need somehow propagate it to the actual method. Some methods already takes Reader
, so it will simple for them. Maybe we just need to start from only those methods and add shortcuts only when them explicitly will requested.
Actually, I already though about storing Decoder
in the attribute itself (but because currently Attribute
is a struct with public fields it will be a breaking change and I not very like the idea of making that new decoder
field public, because it is implementation detail. Maybe in the end we will store already decoded data)
Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached.
Yes, of course we should return error as soon as we found loop or if recursion limit was exceeded.
Should that be two separate errors in your opinion, or one error?
Two different. libxml2 also have two different errors, as you could notice from your link: one is "Detected an entity reference loop", other is "Maximum entity nesting depth exceeded"
Obviously I haven't gotten around to finishing this If you feel inclined to do so, I wouldn't mind if you picked it up. If not, I'll get around to it eventually, I just haven't been doing a lot of coding in my free time and what I have done, was on a different project. |
9776b10
to
d3f5fbe
Compare
d3f5fbe
to
c2d2fbd
Compare
c2d2fbd
to
5c0d5d6
Compare
closes #371