Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow � values #496

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Allow � values #496

wants to merge 1 commit into from

Conversation

sashka
Copy link
Contributor

@sashka sashka commented Oct 26, 2022

Do not return error on parsing preperly escaped zero values, which are perfectly valid. Add simple test to check it.

Do not return error on parsing preperly escaped zero values.
Add simple test to check it.
@dralley
Copy link
Collaborator

dralley commented Oct 26, 2022

@codecov-commenter
Copy link

Codecov Report

Merging #496 (aa932fd) into master (0c6eeb0) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #496      +/-   ##
==========================================
- Coverage   61.70%   61.67%   -0.04%     
==========================================
  Files          33       33              
  Lines       15343    15337       -6     
==========================================
- Hits         9468     9459       -9     
- Misses       5875     5878       +3     
Flag Coverage Δ
unittests 61.67% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/escapei.rs 12.40% <100.00%> (-0.14%) ⬇️
src/de/mod.rs 66.95% <0.00%> (-0.29%) ⬇️
src/lib.rs 13.17% <0.00%> (-0.08%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sashka
Copy link
Contributor Author

sashka commented Oct 26, 2022

According to https://www.w3.org/TR/REC-xml/ the value of &#x0; has never stated invalid.
Got it: in TR XML 1.1 there's a statement "#x0 is still forbidden both directly and as a character reference."

I'll close this PR and will vendor a fork for internal needs then.

@sashka sashka closed this Oct 26, 2022
@dralley
Copy link
Collaborator

dralley commented Oct 26, 2022

Do your internal needs require handling a lot of documents with that issue? Can you speak about the source & purpose of said documents? The "API concerns" I have to imagine are related to C-strings (with null termination) and not strictly relevant to Rust. However, there would be nothing preventing Rust from reading such documents, even though we shouldn't allow them to be output.

(That is not to say that quick-xml should or will support them, but it could be discussed. I don't know what other XML libraries are doing but I imagine Java and C# ones may not always implement the checks since they use UTF-16 encodings)

@sashka
Copy link
Contributor Author

sashka commented Oct 26, 2022

There are lots of huge CPU hardware-related documents (dozens of gigabytes), and there are properly escaped nulls. Minor example of the same kind is Apple's keyboard layout XML.

In both cases I have to read those XMLs exclusively by Rust, and I never write XML, so in my practice I would never face an error on attempt to write NULL.

On the other hand, I may put read-write for NULL under a feature flag to allow this use case when a user (like me or my engineers) sure they needs that.

@dralley
Copy link
Collaborator

dralley commented Oct 27, 2022

@Mingun (and @tafia), what are your thoughts about allowing nominally specification-breaking, but widespread behavior like this?

Is this something that might be acceptable under a feature flag?

@Mingun
Copy link
Collaborator

Mingun commented Oct 27, 2022

Yes, I actually does not understand, why XML forbid character references to some characters in the first place, and I'm glad to allow them. This ability under a feature flag would be welcome.

@sashka, can you add a properly documented feature flag (say, allow-restricted-chars) and add tests for all restricted characters (as defined in the rule RestrictedChar in https://www.w3.org/TR/xml11/#charsets) + &#0;? If we actually already allow some of that ranges (I looks like we allow), then I think we should forbid they usage without a feature flag to be standard compatible. Then rename EntityWithNull to RestrictedChar, add a doc reference to the standard. We use XML 1.1

@Mingun Mingun reopened this Oct 27, 2022
@dralley
Copy link
Collaborator

dralley commented Oct 27, 2022

Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do.

It looks like nokogiri may have at one point just stripped these characters out instead of raising an error. However, attempting to write XML with those characters will raise a fatal error.

On the other hand it looks like both Python's ElementTree and libxml2 raise a hard error even while parsing

In [24]: xml = """<?xml version="1.0"?>
    ...: <data>
    ...:     <country>foobar &#0;</country>
    ...: </data>
    ...: """

In [25]: libxml2.parseDoc(xml)
Entity: line 3: parser error : xmlParseCharRef: invalid xmlChar value 0
    <country>foobar &#0;</country>
                        ^
In [3]: ET.fromstring(xml)
Traceback (most recent call last):

  Input In [3] in <cell line: 1>
    ET.fromstring(xml)

ParseError: reference to invalid character number: line 3, column 20

I believe Python's XML parsing is based on Expat, so between that and libxml2, that covers 2 of the most-used XML libraries.

Are the NUL characters in these documents "load-bearing"? Would they be unsafe to strip out?

@Mingun
Copy link
Collaborator

Mingun commented Oct 27, 2022

The following behaviors are possible:

  1. emit hard error (= stop parsing / writing and return error)
  2. silently skip invalid characters (or at least log them, but we currently does not depend on any logger)
  3. accept invalid characters

Each behavior could be applied to a raw or an escaped form of characters.

I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in Writer API and we can add an API in serde Serializer for that.

The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them.

I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement.

The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of &#0;, but not raw \0.

There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now.

Any thoughts?

@Mingun
Copy link
Collaborator

Mingun commented Oct 27, 2022

Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do.

Actually, you've already open it -- #368.

@sashka
Copy link
Contributor Author

sashka commented Oct 27, 2022

Surprisingly, the only restricted character which is actually restricted is the escaped NULL:

fn test_restricted_chars_unescape() {
    assert_eq!(unescape("\u{0}").unwrap(), "\u{0}");
    // assert_eq!(unescape("&#x0;").unwrap(), "\u{0}"); // <-- this assert will fail with version 0.26.0

    assert_eq!(unescape("\u{1}").unwrap(), "\u{1}");
    assert_eq!(unescape("&#x1;").unwrap(), "\u{1}");
    assert_eq!(unescape("\u{2}").unwrap(), "\u{2}");
    assert_eq!(unescape("&#x2;").unwrap(), "\u{2}");
    assert_eq!(unescape("\u{3}").unwrap(), "\u{3}");
    assert_eq!(unescape("&#x3;").unwrap(), "\u{3}");
    assert_eq!(unescape("\u{4}").unwrap(), "\u{4}");
    assert_eq!(unescape("&#x4;").unwrap(), "\u{4}");
    assert_eq!(unescape("\u{5}").unwrap(), "\u{5}");
    assert_eq!(unescape("&#x5;").unwrap(), "\u{5}");
    assert_eq!(unescape("\u{6}").unwrap(), "\u{6}");
    assert_eq!(unescape("&#x6;").unwrap(), "\u{6}");
    assert_eq!(unescape("\u{7}").unwrap(), "\u{7}");
    assert_eq!(unescape("&#x7;").unwrap(), "\u{7}");
    assert_eq!(unescape("\u{8}").unwrap(), "\u{8}");
    assert_eq!(unescape("&#x8;").unwrap(), "\u{8}");

    assert_eq!(unescape("\u{b}").unwrap(), "\u{b}");
    assert_eq!(unescape("&#xB;").unwrap(), "\u{b}");
    assert_eq!(unescape("\u{c}").unwrap(), "\u{c}");
    assert_eq!(unescape("&#xC;").unwrap(), "\u{c}");

    assert_eq!(unescape("\u{e}").unwrap(), "\u{e}");
    assert_eq!(unescape("&#xE;").unwrap(), "\u{e}");
    assert_eq!(unescape("\u{f}").unwrap(), "\u{f}");
    assert_eq!(unescape("&#xF;").unwrap(), "\u{f}");
    assert_eq!(unescape("\u{10}").unwrap(), "\u{10}");
    assert_eq!(unescape("&#x10;").unwrap(), "\u{10}");
    assert_eq!(unescape("\u{11}").unwrap(), "\u{11}");
    assert_eq!(unescape("&#x11;").unwrap(), "\u{11}");
    assert_eq!(unescape("\u{12}").unwrap(), "\u{12}");
    assert_eq!(unescape("&#x12;").unwrap(), "\u{12}");
    assert_eq!(unescape("\u{13}").unwrap(), "\u{13}");
    assert_eq!(unescape("&#x13;").unwrap(), "\u{13}");
    assert_eq!(unescape("\u{14}").unwrap(), "\u{14}");
    assert_eq!(unescape("&#x14;").unwrap(), "\u{14}");
    assert_eq!(unescape("\u{15}").unwrap(), "\u{15}");
    assert_eq!(unescape("&#x15;").unwrap(), "\u{15}");
    assert_eq!(unescape("\u{16}").unwrap(), "\u{16}");
    assert_eq!(unescape("&#x16;").unwrap(), "\u{16}");
    assert_eq!(unescape("\u{17}").unwrap(), "\u{17}");
    assert_eq!(unescape("&#x17;").unwrap(), "\u{17}");
    assert_eq!(unescape("\u{18}").unwrap(), "\u{18}");
    assert_eq!(unescape("&#x18;").unwrap(), "\u{18}");
    assert_eq!(unescape("\u{19}").unwrap(), "\u{19}");
    assert_eq!(unescape("&#x19;").unwrap(), "\u{19}");
    assert_eq!(unescape("\u{1a}").unwrap(), "\u{1a}");
    assert_eq!(unescape("&#x1A;").unwrap(), "\u{1a}");
    assert_eq!(unescape("\u{1b}").unwrap(), "\u{1b}");
    assert_eq!(unescape("&#x1B;").unwrap(), "\u{1b}");
    assert_eq!(unescape("\u{1c}").unwrap(), "\u{1c}");
    assert_eq!(unescape("&#x1C;").unwrap(), "\u{1c}");
    assert_eq!(unescape("\u{1d}").unwrap(), "\u{1d}");
    assert_eq!(unescape("&#x1D;").unwrap(), "\u{1d}");
    assert_eq!(unescape("\u{1e}").unwrap(), "\u{1e}");
    assert_eq!(unescape("&#x1E;").unwrap(), "\u{1e}");
    assert_eq!(unescape("\u{1f}").unwrap(), "\u{1f}");
    assert_eq!(unescape("&#x1F;").unwrap(), "\u{1f}");

    assert_eq!(unescape("\u{7f}").unwrap(), "\u{7f}");
    assert_eq!(unescape("&#x7F;").unwrap(), "\u{7f}");
    assert_eq!(unescape("\u{80}").unwrap(), "\u{80}");
    assert_eq!(unescape("&#x80;").unwrap(), "\u{80}");
    assert_eq!(unescape("\u{81}").unwrap(), "\u{81}");
    assert_eq!(unescape("&#x81;").unwrap(), "\u{81}");
    assert_eq!(unescape("\u{82}").unwrap(), "\u{82}");
    assert_eq!(unescape("&#x82;").unwrap(), "\u{82}");
    assert_eq!(unescape("\u{83}").unwrap(), "\u{83}");
    assert_eq!(unescape("&#x83;").unwrap(), "\u{83}");
    assert_eq!(unescape("\u{84}").unwrap(), "\u{84}");
    assert_eq!(unescape("&#x84;").unwrap(), "\u{84}");

    assert_eq!(unescape("\u{86}").unwrap(), "\u{86}");
    assert_eq!(unescape("&#x86;").unwrap(), "\u{86}");
    assert_eq!(unescape("\u{87}").unwrap(), "\u{87}");
    assert_eq!(unescape("&#x87;").unwrap(), "\u{87}");
    assert_eq!(unescape("\u{88}").unwrap(), "\u{88}");
    assert_eq!(unescape("&#x88;").unwrap(), "\u{88}");
    assert_eq!(unescape("\u{89}").unwrap(), "\u{89}");
    assert_eq!(unescape("&#x89;").unwrap(), "\u{89}");
    assert_eq!(unescape("\u{8a}").unwrap(), "\u{8a}");
    assert_eq!(unescape("&#x8A;").unwrap(), "\u{8a}");
    assert_eq!(unescape("\u{8b}").unwrap(), "\u{8b}");
    assert_eq!(unescape("&#x8B;").unwrap(), "\u{8b}");
    assert_eq!(unescape("\u{8c}").unwrap(), "\u{8c}");
    assert_eq!(unescape("&#x8C;").unwrap(), "\u{8c}");
    assert_eq!(unescape("\u{8d}").unwrap(), "\u{8d}");
    assert_eq!(unescape("&#x8D;").unwrap(), "\u{8d}");
    assert_eq!(unescape("\u{8e}").unwrap(), "\u{8e}");
    assert_eq!(unescape("&#x8E;").unwrap(), "\u{8e}");
    assert_eq!(unescape("\u{8f}").unwrap(), "\u{8f}");
    assert_eq!(unescape("&#x8F;").unwrap(), "\u{8f}");
    assert_eq!(unescape("\u{90}").unwrap(), "\u{90}");
    assert_eq!(unescape("&#x90;").unwrap(), "\u{90}");
    assert_eq!(unescape("\u{91}").unwrap(), "\u{91}");
    assert_eq!(unescape("&#x91;").unwrap(), "\u{91}");
    assert_eq!(unescape("\u{92}").unwrap(), "\u{92}");
    assert_eq!(unescape("&#x92;").unwrap(), "\u{92}");
    assert_eq!(unescape("\u{93}").unwrap(), "\u{93}");
    assert_eq!(unescape("&#x93;").unwrap(), "\u{93}");
    assert_eq!(unescape("\u{94}").unwrap(), "\u{94}");
    assert_eq!(unescape("&#x94;").unwrap(), "\u{94}");
    assert_eq!(unescape("\u{95}").unwrap(), "\u{95}");
    assert_eq!(unescape("&#x95;").unwrap(), "\u{95}");
    assert_eq!(unescape("\u{96}").unwrap(), "\u{96}");
    assert_eq!(unescape("&#x96;").unwrap(), "\u{96}");
    assert_eq!(unescape("\u{97}").unwrap(), "\u{97}");
    assert_eq!(unescape("&#x97;").unwrap(), "\u{97}");
    assert_eq!(unescape("\u{98}").unwrap(), "\u{98}");
    assert_eq!(unescape("&#x98;").unwrap(), "\u{98}");
    assert_eq!(unescape("\u{99}").unwrap(), "\u{99}");
    assert_eq!(unescape("&#x99;").unwrap(), "\u{99}");
    assert_eq!(unescape("\u{9a}").unwrap(), "\u{9a}");
    assert_eq!(unescape("&#x9A;").unwrap(), "\u{9a}");
    assert_eq!(unescape("\u{9b}").unwrap(), "\u{9b}");
    assert_eq!(unescape("&#x9B;").unwrap(), "\u{9b}");
    assert_eq!(unescape("\u{9c}").unwrap(), "\u{9c}");
    assert_eq!(unescape("&#x9C;").unwrap(), "\u{9c}");
    assert_eq!(unescape("\u{9d}").unwrap(), "\u{9d}");
    assert_eq!(unescape("&#x9D;").unwrap(), "\u{9d}");
    assert_eq!(unescape("\u{9e}").unwrap(), "\u{9e}");
    assert_eq!(unescape("&#x9E;").unwrap(), "\u{9e}");
    assert_eq!(unescape("\u{9f}").unwrap(), "\u{9f}");
    assert_eq!(unescape("&#x9F;").unwrap(), "\u{9f}");
}

@dralley
Copy link
Collaborator

dralley commented Oct 27, 2022

It appears that C#'s dotnet XML library has a flag for this: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreadersettings.checkcharacters?view=net-6.0

Same for writing: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlwritersettings.checkcharacters?view=net-6.0

Surprisingly, the only restricted character which is actually restricted is the escaped NULL:

I'm not surprised given I filed #368. It's something we needed to improve anyways. Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious.

@sashka
Copy link
Contributor Author

sashka commented Oct 27, 2022

I'll keep all the work for allow-restricted-chars right there: https://github.com/sashka/quick-xml/tree/allow-restricted-chars

Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious.

No, vendors don't care about general XML validity as soon as their XML files are not for public distribution, and their own tools process their own files perfectly.

@dralley
Copy link
Collaborator

dralley commented Oct 28, 2022

The following behaviors are possible:

  • emit hard error (= stop parsing / writing and return error)
  • silently skip invalid characters (or at least log them, but we currently does not depend on any logger)
  • accept invalid characters

Each behavior could be applied to a raw or an escaped form of characters.

I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in Writer API and we can add an API in serde Serializer for that.

Agree that we shouldn't skip over invalid data. Seems like a terrible approach.

The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them.

I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement.

The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of �, but not raw \0.

There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now.

Any thoughts?

An initial conservative approach could be to just provide an advanced version of the various escape / unescape functions which would allow you to configure the behavior either way at runtime. I think I would prefer that approach over a feature flag.

Agreed that we're still missing the checks for these characters in the raw form and that we ought to have them.

Reading through the MS docs again, it appears that CheckCharacters applies only to character entities. Names and raw text still need to be completely valid and contain no illegal characters no matter what - and it's checked always. @sashka I assume all of the illegal characters are at least encoded as character references and not present in raw form? If that's the case then copying that approach would be sufficient. And as the configuration would only apply to (un)escaping, there wouldn't be issues with needing a pseudo-global configuration.

@dralley
Copy link
Collaborator

dralley commented Nov 9, 2022

Does that sound right @sashka ?

@sashka
Copy link
Contributor Author

sashka commented Nov 10, 2022

@dralley, yes, it sounds right for me. I'm working on proper reading mechanics now.

@dralley
Copy link
Collaborator

dralley commented Jul 10, 2023

@sashka Do you plan to continue with this?

@dralley dralley marked this pull request as draft July 10, 2023 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants