Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Light cleanup on html_render_diff.py #145

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented May 2, 2023

There’s a stupendous amount of cleanup work I’ve been meaning to circle back and do in html_render_diff.py for YEARS. This is a start.

I plan to keep this PR relatively narrowly focused on refactoring, removing/replacing vestigial code, and style fixes. I’m not going to make significant behavior changes here (e.g. potentially changing the “spacer” concept, which needs a major rethink). Changes like that need a lot more careful consideration and testing, and I need time to get my head back into this space in order to do that well.

Work in progress. Still a little more to do here, although I don’t want to bit off too much. I want to:

  • Look at removing _limit_spacers() and merging it into _insert_spacers() (good performance gain to be had there).

  • We tokenize in two steps that could potentially be one: flatten_el turns a DOM tree into an iterator of (token_type, extra_info) tuples. Then, fixup_chunks() turns those tuples into token objects to be diffed.

    # Then we split the document into text chunks for each tag, word, and end tag:
    chunks = flatten_el(body_el, skip_tag=True, include_hrefs=include_hrefs)
    # Finally re-joining them into token objects:
    return fixup_chunks(chunks, comparator)

    This was done back in the day when we used tokenization from inside lxml and added our own extras on top. We no longer do that, and it’s possible the code may be clearer if we combine these steps. It also might not be! I want to prototype this and see how it feels. (Important: I don’t think there’s a big performance gain to be had here, since the first step produces an iterator that we use pretty efficiently. But we could probably reduce some object allocations for the tuples, though.)

  • Consider fixing the double parsing we do when tokenizing…

    diff_elements() calls _diffable_fragment(), which turns a soup DOM into a string, and passes that to _htmldiff():

    metadata, raw_diffs = _htmldiff(_diffable_fragment(old),
    _diffable_fragment(new),
    comparator,
    include)

    Then _htmldiff() calls tokenize() on that string:

    old_tokens = tokenize(old, comparator)
    new_tokens = tokenize(new, comparator)

    Then tokenize() calls parse_html() on the string to make it an etree DOM:

    if etree.iselement(html):
    body_el = html
    else:
    body_el = parse_html(html)
    # Then we split the document into text chunks for each tag, word, and end tag:
    chunks = flatten_el(body_el, skip_tag=True, include_hrefs=include_hrefs)

    One complexity here is that we start by dealing with a beatifulsoup DOM, then the lower-level code is designed to deal with an etree DOM. The structure of these two is not the same, and that may introduce issues. etree leaves spaces mostly untouched, but beautifulsoup does some space cleanup. etree elements also have a text and tail, while beautifulsoup deals with a mix of tags and strings (and other node types). This may be introduce enough complexity that this should be tackled separately. Needs examination.

Trim spaces around docstring text, use consistent style for multiline docstrings.
In diffs that went over the maximum number of spacers, it turns out that the `_limit_spacers()` function stripped out important tag information! This fixes the issue, but introduces some performance overhead. To handle that, a follow-on change should consider:
1. Moving the spacer-limiting logic into `customize_tokens()` so we don't even create too many spacers in the first place.
2. Revisit the whole spacer approach in the first place. There may be better approaches now.
This resolves some not-useful warnings about invalid escapes that we were getting. Nothing should be escaped in here in the first place; it's a pure JavaScript string with no substitutions or dynamic values.
There's a big TODO about removing this when we finally fully forked lxhtml's differ. That happened a long time ago, and we did in fact make the changes that turned this into effectively wasted iteration/dead code. I ran a few tests over a variety of big and small diffs to make sure the code being removed here really doesn't do anything anymore, and that seems to be the case. Reading the logic, it also seems like this should be entirely vestigial, and never wind up actually changing the tokens.
The only thing this function was doing was replacing `href_token` instances with `MinimalHrefToken`. We did this at a time when we were using parts of the tokenization internals from lxml instead of fully forking it. We have long since fully forked it, however, and we should just be creating `MinimalHrefToken` where we want them in the first place instead of looping through and replacing other tokens with them.
This is now more accurate to what the function does.
@Mr0grog
Copy link
Member Author

Mr0grog commented May 2, 2023

Re: removing _limit_spacers(). This has a pretty big impact on DOMs with too many spacers, but not much of an impact otherwise. I was expecting the extra iteration and instance checking, etc. to be kind of expensive on large DOMs (that don’t have too many spacers), but it actually isn’t (I guess those DOMs just aren’t large enough to matter in the first place?). BUT once you start making too many spacers, this has an extremely noticeable performance impact.

So, there’s some value from that change, but only in the most extreme cases.

On the other hand, this suggests that a future where we remove the spacers altogether is also a future with much better overall performance than I’d expected. (That said, the actual diffing still takes the majority of the running time.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant