Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url_absolute fails with spaces in url #401

Open
jonthegeek opened this issue Aug 23, 2023 · 0 comments
Open

url_absolute fails with spaces in url #401

jonthegeek opened this issue Aug 23, 2023 · 0 comments
Labels
bug an unexpected problem or unintended behavior url 👑

Comments

@jonthegeek
Copy link

xml_attr(x, "href") returns un-encoded URLs if that's how they appear in the source, but then those URLs fail in url_absolute.

url <- "/filename with spaces.pdf" 
xml2::url_absolute(
  url,
  base = "https://example.com/"
)
#> [1] NA
xml2::url_absolute(
  utils::URLencode(url),
  base = "https://example.com/"
)
#> [1] "https://example.com/filename%20with%20spaces.pdf"

Created on 2023-08-23 with reprex v2.0.2

url_absolute() gets confused if the URL contains spaces, and silently returns NA. This should at least warn the user, but it might be preferable to deal with it directly.

This is where I found it in the wild:

base_url <- "https://www.copyright.gov/fair-use/fair-index.html"

pdf_urls <-
  rvest::read_html(base_url) |> 
  rvest::html_element("table") |> 
  rvest::html_elements("tr>td:first-of-type>a:first-of-type") |>
  rvest::html_attr("href")

pdf_urls[[10]] |> 
  rvest::url_absolute(base_url)
#> [1] NA

pdf_urls[[10]] |> 
  utils::URLencode() |> 
  rvest::url_absolute(base_url)
#> [1] "https://www.copyright.gov/fair-use/summaries/ONeil%20v.%20Ratajkowski%20No.%2019%20CIV.%209769%20(S.D.N.Y.%202021).pdf"

Created on 2023-08-23 with reprex v2.0.2

@hadley hadley added bug an unexpected problem or unintended behavior url 👑 labels Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior url 👑
Projects
None yet
Development

No branches or pull requests

2 participants