Light cleanup on html_render_diff.py #145

Mr0grog · 2023-05-02T01:45:19Z

There’s a stupendous amount of cleanup work I’ve been meaning to circle back and do in html_render_diff.py for YEARS. This is a start.

I plan to keep this PR relatively narrowly focused on refactoring, removing/replacing vestigial code, and style fixes. I’m not going to make significant behavior changes here (e.g. potentially changing the “spacer” concept, which needs a major rethink). Changes like that need a lot more careful consideration and testing, and I need time to get my head back into this space in order to do that well.

Work in progress. Still a little more to do here, although I don’t want to bit off too much. I want to:

Look at removing _limit_spacers() and merging it into _insert_spacers() (good performance gain to be had there).

We tokenize in two steps that could potentially be one: flatten_el turns a DOM tree into an iterator of (token_type, extra_info) tuples. Then, fixup_chunks() turns those tuples into token objects to be diffed.

web-monitoring-diff/web_monitoring_diff/html_render_diff.py

Lines 767 to 770 in cb359af

    
           # Then we split the document into text chunks for each tag, word, and end tag: 
        
           chunks = flatten_el(body_el, skip_tag=True, include_hrefs=include_hrefs) 
        
           # Finally re-joining them into token objects: 
        
           return fixup_chunks(chunks, comparator)

This was done back in the day when we used tokenization from inside lxml and added our own extras on top. We no longer do that, and it’s possible the code may be clearer if we combine these steps. It also might not be! I want to prototype this and see how it feels. (Important: I don’t think there’s a big performance gain to be had here, since the first step produces an iterator that we use pretty efficiently. But we could probably reduce some object allocations for the tuples, though.)

Consider fixing the double parsing we do when tokenizing…

diff_elements() calls _diffable_fragment(), which turns a soup DOM into a string, and passes that to _htmldiff():

web-monitoring-diff/web_monitoring_diff/html_render_diff.py

Lines 518 to 521 in cb359af

    
           metadata, raw_diffs = _htmldiff(_diffable_fragment(old), 
        
                                           _diffable_fragment(new), 
        
                                           comparator, 
        
                                           include)

Then _htmldiff() calls tokenize() on that string:

web-monitoring-diff/web_monitoring_diff/html_render_diff.py

Lines 555 to 556 in cb359af

    
           old_tokens = tokenize(old, comparator) 
        
           new_tokens = tokenize(new, comparator)

Then tokenize() calls parse_html() on the string to make it an etree DOM:

web-monitoring-diff/web_monitoring_diff/html_render_diff.py

Lines 763 to 768 in cb359af

    
           if etree.iselement(html): 
        
               body_el = html 
        
           else: 
        
               body_el = parse_html(html) 
        
           # Then we split the document into text chunks for each tag, word, and end tag: 
        
           chunks = flatten_el(body_el, skip_tag=True, include_hrefs=include_hrefs)

One complexity here is that we start by dealing with a beatifulsoup DOM, then the lower-level code is designed to deal with an etree DOM. The structure of these two is not the same, and that may introduce issues. etree leaves spaces mostly untouched, but beautifulsoup does some space cleanup. etree elements also have a text and tail, while beautifulsoup deals with a mix of tags and strings (and other node types). This may be introduce enough complexity that this should be tackled separately. Needs examination.

Trim spaces around docstring text, use consistent style for multiline docstrings.

In diffs that went over the maximum number of spacers, it turns out that the `_limit_spacers()` function stripped out important tag information! This fixes the issue, but introduces some performance overhead. To handle that, a follow-on change should consider: 1. Moving the spacer-limiting logic into `customize_tokens()` so we don't even create too many spacers in the first place. 2. Revisit the whole spacer approach in the first place. There may be better approaches now.

This resolves some not-useful warnings about invalid escapes that we were getting. Nothing should be escaped in here in the first place; it's a pure JavaScript string with no substitutions or dynamic values.

There's a big TODO about removing this when we finally fully forked lxhtml's differ. That happened a long time ago, and we did in fact make the changes that turned this into effectively wasted iteration/dead code. I ran a few tests over a variety of big and small diffs to make sure the code being removed here really doesn't do anything anymore, and that seems to be the case. Reading the logic, it also seems like this should be entirely vestigial, and never wind up actually changing the tokens.

The only thing this function was doing was replacing `href_token` instances with `MinimalHrefToken`. We did this at a time when we were using parts of the tokenization internals from lxml instead of fully forking it. We have long since fully forked it, however, and we should just be creating `MinimalHrefToken` where we want them in the first place instead of looping through and replacing other tokens with them.

This is now more accurate to what the function does.

Mr0grog · 2023-05-02T02:42:35Z

Re: removing _limit_spacers(). This has a pretty big impact on DOMs with too many spacers, but not much of an impact otherwise. I was expecting the extra iteration and instance checking, etc. to be kind of expensive on large DOMs (that don’t have too many spacers), but it actually isn’t (I guess those DOMs just aren’t large enough to matter in the first place?). BUT once you start making too many spacers, this has an extremely noticeable performance impact.

So, there’s some value from that change, but only in the most extreme cases.

On the other hand, this suggests that a future where we remove the spacers altogether is also a future with much better overall performance than I’d expected. (That said, the actual diffing still takes the majority of the running time.)

Mr0grog added 7 commits May 1, 2023 17:37

Clean up docstring style

4d7082c

Trim spaces around docstring text, use consistent style for multiline docstrings.

Make contrast script a raw string

38245ef

This resolves some not-useful warnings about invalid escapes that we were getting. Nothing should be escaped in here in the first place; it's a pure JavaScript string with no substitutions or dynamic values.

Rename _customize_tokens to _insert_spacers

6b425b6

This is now more accurate to what the function does.

Don't insert more spacers than we are allowed

b020d0c

Mr0grog mentioned this pull request Jan 2, 2024

Update coverage requirement from ~=7.2 to ~=7.4 #162

Closed

This was referenced May 1, 2024

Update coverage requirement from ~=7.2 to ~=7.5 #174

Closed

Update pytest requirement from ~=7.4.4 to ~=8.2.0 #172

Closed

Update sentry-sdk requirement from <2.0,>=1.0.0 to >=1.0.0,<3.0 #173

Closed

This was referenced Jul 1, 2024

Update pytest requirement from ~=7.4.4 to ~=8.2.2 #176

Closed

Update sentry-sdk requirement from <2.0,>=1.0.0 to >=2.7.1,<3.0 #177

Closed

This was referenced Aug 1, 2024

Update sentry-sdk requirement from <2.0,>=1.0.0 to >=1.0.0,<3.0 #179

Draft

Update pytest requirement from ~=7.4.4 to ~=8.3.2 #180

Closed

Update coverage requirement from ~=7.2 to ~=7.6 #182

Draft

Mr0grog mentioned this pull request Oct 1, 2024

Update pytest requirement from ~=7.4.4 to ~=8.3.3 #185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Light cleanup on html_render_diff.py #145

Light cleanup on html_render_diff.py #145

Mr0grog commented May 2, 2023 •

edited

Loading

Mr0grog commented May 2, 2023

	# Then we split the document into text chunks for each tag, word, and end tag:
	chunks = flatten_el(body_el, skip_tag=True, include_hrefs=include_hrefs)
	# Finally re-joining them into token objects:
	return fixup_chunks(chunks, comparator)

	metadata, raw_diffs = _htmldiff(_diffable_fragment(old),
	_diffable_fragment(new),
	comparator,
	include)

	old_tokens = tokenize(old, comparator)
	new_tokens = tokenize(new, comparator)

	if etree.iselement(html):
	body_el = html
	else:
	body_el = parse_html(html)
	# Then we split the document into text chunks for each tag, word, and end tag:
	chunks = flatten_el(body_el, skip_tag=True, include_hrefs=include_hrefs)

Light cleanup on html_render_diff.py #145

Are you sure you want to change the base?

Light cleanup on html_render_diff.py #145

Conversation

Mr0grog commented May 2, 2023 • edited Loading

Mr0grog commented May 2, 2023

Mr0grog commented May 2, 2023 •

edited

Loading