`read_html()` doesn't report parsing failure on very very long lines #440

hadley · 2024-02-27T14:23:07Z

library(xml2)

path <- tempfile()

long <- paste0("start", strrep("x", 12e6), "end")
nchar(long)
#> [1] 12000008

cat(
  "<html><body>\n<script type=\"application/json\">",
  long,
  "</script>\n</body></html>\n",
  file = path,
  sep = ""
)

html <- read_html(path)
xml <- read_xml(path)
#> Warning in read_xml.character(path): xmlSAX2Characters: huge text nod [2]
#> Error in read_xml.character(path): Extra content at the end of the document [5]

^{Created on 2024-02-27 with reprex v2.1.0}

From tidyverse/rvest#399

hadley mentioned this issue Feb 27, 2024

Long lines truncated at 10,000,000 chars. tidyverse/rvest#399

Closed

hadley added the bug an unexpected problem or unintended behavior label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_html()` doesn't report parsing failure on very very long lines #440

`read_html()` doesn't report parsing failure on very very long lines #440

hadley commented Feb 27, 2024

read_html() doesn't report parsing failure on very very long lines #440

read_html() doesn't report parsing failure on very very long lines #440

Comments

hadley commented Feb 27, 2024

`read_html()` doesn't report parsing failure on very very long lines #440

`read_html()` doesn't report parsing failure on very very long lines #440