Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atom: use correct xml:base for decoded elements #222

Merged
merged 3 commits into from
Mar 1, 2024

Conversation

cristoper
Copy link
Contributor

The issue is that goxpp's DecodeElement() pop's the BaseStack after unmarshaling an element -- as it should to keep the BaseStack in sync with the current document scope. But that means the atom parser needs to keep a reference to the current base before calling DecodeElement() so that it can correctly resolve URLs.

Without this fix, elements with xml:base attributes will be erroneously resolved with the parent xml:base.

This fix is a little awkward with all of its saving and restoring of BaseStacks. A simpler solution would be to simply get the xml:base attribute from the decoded element and resolve against that. However, that would require changes to goxpp: either change XmlBaseResolveUrl() to accept a string rather than acting as a method, or maybe adding a separate ResolveUrl() function that takes a base and a relative URL that gofeed could use.

In order to keep tracking xml:base correctly, the goxpp's
`DecodeElement` pops the BaseStack if the start element added a base (if
any).

That means the atom parser needs keep track of the base *before* calling
`DecodeElement` to use for resolving relative URLs within the decoded
element.

Without this fix, elements with xml:base attributes will be erroneously
resolved with the parent xml:base.
@cristoper cristoper mentioned this pull request Feb 24, 2024
@mmcdole
Copy link
Owner

mmcdole commented Feb 28, 2024

This fix is a little awkward with all of its saving and restoring of BaseStacks. A simpler solution would be to simply get the xml:base attribute from the decoded element and resolve against that. However, that would require changes to goxpp: either change XmlBaseResolveUrl() to accept a string rather than acting as a method, or maybe adding a separate ResolveUrl() function that takes a base and a relative URL that gofeed could use.

Maybe this makes sense @cristoper ?

You are suggesting that in goxpp we have:

func (p *XMLPullParser) ResolveUrl(baseURL, relativeUrl string) (string, error) {
    // Logic to resolve relativeUrl against baseURL + the current BaseStack URLs?
}

Then in Atom Parser, something like this?:

func (ap *Parser) parseAtomText(p *xpp.XMLPullParser) (string, error) {
    var text struct {
        Type     string `xml:"type,attr"`
        Mode     string `xml:"mode,attr"`
        Base     string `xml:"base,attr"` // Added to capture xml:base attribute
        InnerXML string `xml:",innerxml"`
    }

    // DecodeElement is used to unmarshal the element
    err := p.DecodeElement(&text)
    if err != nil {
        return "", err
    }

    // Use the captured xml:base if present; otherwise, use the current base URL
    baseURL := text.Base
    if baseURL == "" {
        baseURL = p.GetCurrentBaseURL() // Fallback to current base URL from the parser
    }

    // Resolve URLs in InnerXML using the determined base URL
    resolvedInnerXML, err := p.ResolveURL(baseURL, text.InnerXML)
    if err != nil {
        return "", err
    }

    return strings.TrimSpace(resolvedInnerXML), nil
}

@cristoper
Copy link
Contributor Author

Yes, that's approximately what I had in mind. I just pushed an update to the PR which does similar but copies XmlBaseResolveUrl to gofeed so that we don't have to change goxpp.

However, if you'd prefer not duplicating code from goxpp we could instead make the change there (even by introducing a new function so we don't make any backward-incompatible change to the goxpp API).

This provides an equivalent fix that doesn't do any inelegant swapping
out of the BaseStack. It also doesn't change `goxpp`'s public API by
essentially copying `XmlBaseResolveUrl` to `gofeed`.
cristoper added a commit to cristoper/gofeed that referenced this pull request Feb 29, 2024
PR mmcdole#220 introduced a failing test for detecting images in the "content"
element. It should instead be testing the "content:encoded" element. But
that uncovered an issue with how extensions were being detected (the
"content" namespace was being detected as an extension namespace).

As a more robust way of checking for the "content" namespace, this PR
exposes `shared.PrefixForNamspace()` as a public function so it can be
used in the rss parser. This provides a solution to PR mmcdole#211 (and
includes @JLugagne's test case from that PR).

Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix
the remaining failing test reported in mmcdole#210.
cristoper added a commit to cristoper/gofeed that referenced this pull request Feb 29, 2024
PR mmcdole#220 introduced a failing test for detecting images in the "content"
element. It should instead be testing the "content:encoded" element. But
that uncovered an issue with how extensions were being detected (the
"content" namespace was being detected as an extension namespace).

As a more robust way of checking for the "content" namespace, this PR
exposes `shared.PrefixForNamspace()` as a public function so it can be
used in the rss parser. This provides a solution to PR mmcdole#211 (and
includes a test based on @JLugagne's test case from that PR).

Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix
the remaining failing test reported in mmcdole#210.
cristoper added a commit to cristoper/gofeed that referenced this pull request Feb 29, 2024
PR mmcdole#220 introduced a failing test for detecting images in the "content"
element. It should instead be testing the "content:encoded" element. But
that uncovered an issue with how extensions were being detected (the
"content" namespace was being detected as an extension namespace).

As a more robust way of checking for the "content" namespace, this PR
exposes `shared.PrefixForNamspace()` as a public function so it can be
used in the rss parser. This should also fix PR mmcdole#211 (and includes
@JLugagne's test case from that PR).

Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix
the remaining failing test reported in mmcdole#210.
mmcdole pushed a commit that referenced this pull request Mar 1, 2024
PR #220 introduced a failing test for detecting images in the "content"
element. It should instead be testing the "content:encoded" element. But
that uncovered an issue with how extensions were being detected (the
"content" namespace was being detected as an extension namespace).

As a more robust way of checking for the "content" namespace, this PR
exposes `shared.PrefixForNamspace()` as a public function so it can be
used in the rss parser. This should also fix PR #211 (and includes
@JLugagne's test case from that PR).

Once the fixes to xml:base handling in #222 are merged, this should fix
the remaining failing test reported in #210.
Copy link
Owner

@mmcdole mmcdole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super clean, love this approach.

@mmcdole mmcdole merged commit 9455e2b into mmcdole:master Mar 1, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants