Skip to content

Commit

Permalink
Fix additional < followed by characters and EOF issues (#728)
Browse files Browse the repository at this point in the history
This fixes these two cases:

* "<some thing thing" where "thing" is repeated twice which kicks up a
  parser error because it thinks it's a duplicated attribute
* "<some thing thing2 " where the space at the end causes a
  expected-end-of-tag-but-got-eof parser error to pop up

In both of these cases, we want the data to be treated as character
data--not a tag.
  • Loading branch information
willkg committed Oct 25, 2024
1 parent 648a97d commit 8ee9fbd
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 3 deletions.
15 changes: 12 additions & 3 deletions bleach/html5lib_shim.py
Original file line number Diff line number Diff line change
Expand Up @@ -396,16 +396,25 @@ def __iter__(self):
# name that abruptly ends, but we should treat that like
# character data
yield {"type": TAG_TOKEN_TYPE_CHARACTERS, "data": self.stream.get_tag()}

elif last_error_token["data"] in (
"duplicate-attribute",
"eof-in-attribute-name",
"eof-in-attribute-value-no-quotes",
"expected-end-of-tag-but-got-eof",
):
# Handle the case where the text being parsed ends with <
# followed by a series of characters and then space and then
# more characters. It's treated as a tag name followed by an
# followed by characters and then space and then:
#
# * more characters
# * more characters repeated with a space between (e.g. "abc abc")
# * more characters and then a space and then an EOF (e.g. "abc def ")
#
# These cases are treated as a tag name followed by an
# attribute that abruptly ends, but we should treat that like
# character data.
# character data instead.
yield {"type": TAG_TOKEN_TYPE_CHARACTERS, "data": self.stream.get_tag()}

else:
yield last_error_token

Expand Down
4 changes: 4 additions & 0 deletions tests/test_clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,10 @@ def test_bare_entities_get_escaped_correctly(text, expected):
("<some thing", "&lt;some thing"),
# this is an eof-in-attribute-value-no-quotes parser error
("<some thing=foo", "&lt;some thing=foo"),
# this is a duplicate-attribute parser error
("<some thing thing", "&lt;some thing thing"),
# this is an expected-end-of-tag-but-got-eof parser error
("<some thing thing2 ", "&lt;some thing thing2 "),
],
)
def test_lessthan_escaping(text, expected):
Expand Down

0 comments on commit 8ee9fbd

Please sign in to comment.