-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Light cleanup on html_render_diff.py #145
base: main
Are you sure you want to change the base?
Conversation
Trim spaces around docstring text, use consistent style for multiline docstrings.
In diffs that went over the maximum number of spacers, it turns out that the `_limit_spacers()` function stripped out important tag information! This fixes the issue, but introduces some performance overhead. To handle that, a follow-on change should consider: 1. Moving the spacer-limiting logic into `customize_tokens()` so we don't even create too many spacers in the first place. 2. Revisit the whole spacer approach in the first place. There may be better approaches now.
This resolves some not-useful warnings about invalid escapes that we were getting. Nothing should be escaped in here in the first place; it's a pure JavaScript string with no substitutions or dynamic values.
There's a big TODO about removing this when we finally fully forked lxhtml's differ. That happened a long time ago, and we did in fact make the changes that turned this into effectively wasted iteration/dead code. I ran a few tests over a variety of big and small diffs to make sure the code being removed here really doesn't do anything anymore, and that seems to be the case. Reading the logic, it also seems like this should be entirely vestigial, and never wind up actually changing the tokens.
The only thing this function was doing was replacing `href_token` instances with `MinimalHrefToken`. We did this at a time when we were using parts of the tokenization internals from lxml instead of fully forking it. We have long since fully forked it, however, and we should just be creating `MinimalHrefToken` where we want them in the first place instead of looping through and replacing other tokens with them.
This is now more accurate to what the function does.
Re: removing So, there’s some value from that change, but only in the most extreme cases. On the other hand, this suggests that a future where we remove the spacers altogether is also a future with much better overall performance than I’d expected. (That said, the actual diffing still takes the majority of the running time.) |
There’s a stupendous amount of cleanup work I’ve been meaning to circle back and do in
html_render_diff.py
for YEARS. This is a start.I plan to keep this PR relatively narrowly focused on refactoring, removing/replacing vestigial code, and style fixes. I’m not going to make significant behavior changes here (e.g. potentially changing the “spacer” concept, which needs a major rethink). Changes like that need a lot more careful consideration and testing, and I need time to get my head back into this space in order to do that well.
Work in progress. Still a little more to do here, although I don’t want to bit off too much. I want to:
Look at removing
_limit_spacers()
and merging it into_insert_spacers()
(good performance gain to be had there).We tokenize in two steps that could potentially be one:
flatten_el
turns a DOM tree into an iterator of(token_type, extra_info)
tuples. Then,fixup_chunks()
turns those tuples into token objects to be diffed.web-monitoring-diff/web_monitoring_diff/html_render_diff.py
Lines 767 to 770 in cb359af
This was done back in the day when we used tokenization from inside lxml and added our own extras on top. We no longer do that, and it’s possible the code may be clearer if we combine these steps. It also might not be! I want to prototype this and see how it feels. (Important: I don’t think there’s a big performance gain to be had here, since the first step produces an iterator that we use pretty efficiently. But we could probably reduce some object allocations for the tuples, though.)
Consider fixing the double parsing we do when tokenizing…
diff_elements()
calls_diffable_fragment()
, which turns a soup DOM into a string, and passes that to_htmldiff()
:web-monitoring-diff/web_monitoring_diff/html_render_diff.py
Lines 518 to 521 in cb359af
Then
_htmldiff()
callstokenize()
on that string:web-monitoring-diff/web_monitoring_diff/html_render_diff.py
Lines 555 to 556 in cb359af
Then
tokenize()
callsparse_html()
on the string to make it an etree DOM:web-monitoring-diff/web_monitoring_diff/html_render_diff.py
Lines 763 to 768 in cb359af
One complexity here is that we start by dealing with a beatifulsoup DOM, then the lower-level code is designed to deal with an etree DOM. The structure of these two is not the same, and that may introduce issues. etree leaves spaces mostly untouched, but beautifulsoup does some space cleanup. etree elements also have a
text
andtail
, while beautifulsoup deals with a mix of tags and strings (and other node types). This may be introduce enough complexity that this should be tackled separately. Needs examination.