Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure child node is a child of parent before removing #1438

Closed
wants to merge 1 commit into from

Conversation

shama
Copy link

@shama shama commented Aug 24, 2023

Fixes GH-1401

If current isn't a child of parent calling parent.removeChild(current) will result in the error: Failed to execute 'removeChild' on 'Node': The node to be removed is not a child of this node. This commonly occurs when the DOM is externally modified by a browser, such as when translating a page. The nextSibling may not be a child node of parent.

current.parentNode?.removeChild(current) ensures we are only removing a child node of a parent node.

@NullVoxPopuli
Copy link
Contributor

Thank you for submitting!!!! <3 🥳

@chancancode
Copy link
Contributor

This is a very targeted fix and may have other unintended consequences or just masks the problem even more. The root cause is usually that some external extensions or scripts messed with the parts of the DOM that glimmer-vm is managing/in charge of. In general that really is unsound and could cause things to fail in numerous ways. Maybe we can try to fail more silently as this PR tries to do, but if anything else goes wrong the error will probably be even more confusing

@@ -65,7 +65,6 @@ export function move(bounds: Bounds, reference: Nullable<SimpleNode>): Nullable<
}

export function clear(bounds: Bounds): Nullable<SimpleNode> {
let parent = bounds.parentElement();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this

@@ -75,7 +74,7 @@ export function clear(bounds: Bounds): Nullable<SimpleNode> {
while (true) {
let next = current.nextSibling;

parent.removeChild(current);
current.parentNode?.removeChild(current);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..and then in here, guard with if (parent === current.parentNode). In the else branch, we may want to add two different development mode warnings for the case where it is null and for the case it is not null but different (ie the node has been moved)

Fixes glimmerjsGH-1401

If `current` isn't a child of `parent` calling `parent.removeChild(current)` will result in the error: `Failed to execute 'removeChild' on 'Node': The node to be removed is not a child of this node`. This commonly occurs when the DOM is externally modified by a browser, such as when translating a page. The `nextSibling` may not be a child node of `parent`.
@shama
Copy link
Author

shama commented Aug 25, 2023

@chancancode Thank you for the review! I agree with everything you said, it is a targeted fix. I'm happy to do more research into edge cases this change might cause too.

FWIW, in our Ember app (app.ynab.com), since upgrading Ember and adopting Glimmer in the last 3 months, we've seen this error over 76k+ times. It requires each user to refresh and figure out which extension or browser usage causes the error. From our perspective, it seems like a very common use case for end users of Glimmer to externally modify the DOM. Having Glimmer handle that condition would be immensely helpful. Our Ember app is 9+ years old and has a pretty robust test suite. It might not mean anything but all our tests are passing with this change but I'd be happy to do more exploratory testing to ensure this change doesn't have unintended consequences. I appreciate your time.

@lifeart
Copy link
Contributor

lifeart commented Aug 25, 2023

@chancancode I agree with This is a very targeted fix, but, since we are living in real world and dozens of extensions / browsers do modify DOM for it's own purposes, we have to at least, to try to not fail (IMHO). Because it's likely not a case where users with such cases use real "SPA" approach, and possible memory leaks may be "healed" by app reloading (same as we have to do now once glimmer-vm fails)

we could have special error handler for it, to be able (in background) send "mutated" node to logging apps to figure out source of breakage.

@NullVoxPopuli
Copy link
Contributor

I'm wanting to run this against a couple additional test suites -- but it looks like I first need to update glimmer in ember, here: emberjs/ember.js#20530

@lifeart
Copy link
Contributor

lifeart commented Aug 25, 2023

I had idea (long time ago) to introduce addon to track DOM mutations outside re-renders, using https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver api, and emit errors/messages if DOM mutated outside of ember's (glimmer's) render loop.

@shama
Copy link
Author

shama commented Sep 18, 2023

FWIW, we've been running this fix in production in app.ynab.com via yarn patch for the past month without any issues and our users are no longer hitting this error.

@NullVoxPopuli
Copy link
Contributor

apologies I've been so slow to get this shipped :(

I think the changes look good, but I emberjs is already behind glimmer, and I've had it on my todo to get it up to date -- but work has been bogging me down with 🙃 tasks

@gilest
Copy link

gilest commented Oct 25, 2023

FWIW, we've been running this fix in production in app.ynab.com via yarn patch for the past month

Mind sharing more about how you achieved this? I thought @glimmer/vm was bundled with ember-source or something...?

Ah – I'm guessing you patched in ember-source/dist

@shama
Copy link
Author

shama commented Oct 25, 2023

@gilest We used yarn patch but patch-package is a similar tool if you're not using yarn.

I ran yarn patch ember-source@npm:5.3.0 which copies the package to a temporary folder. I edited all the instances (in dist/dependencies/@glimmer/runtime.js and dist/ember.debug.js):

-    parent.removeChild(current);
+    if (parent === current.parentNode) {
+      current.parentNode.removeChild(current);
+    }

Then ran yarn patch-commit -s /tmp/path/to/your/changes/.

Then did the same thing with yarn patch @glimmer/runtime. After that, yarn install will also apply those patches from the .yarn/patches/ folder.

@gilest
Copy link

gilest commented Oct 25, 2023

Interesting. Tried the simplified version of the patch with an ember-source 4.12 app using Fastboot.

-    parent.removeChild(current);
+    if (parent === current.parentNode) {
+      current.parentNode.removeChild(current);
+    }

Before patch

On a translated page (Google Chrome), clicking a link to initiate a transition throws the reported error:

Failed to execute 'removeChild' on 'Node': The node to be removed is not a child of this node

After patch

Following the same reproduction throws in the clear function but at the beginning of the loop instead:

Cannot read properties of null (reading 'nextSibling')

So the node count or node references still get out of sync once page DOM is maniuplated (at least in this reproduction).

@chancancode
Copy link
Contributor

Yeah again I am to the extent that we can be a little less anal and silently ignore errors that “didn’t end up mattering”, it seems ok.

But I think it really is unrealistic to expect that this can be “fixed”. I am not trying to be difficult here but let’s try to engage on the actual issue here.

If Google translate or any other extensions just go in and mutate the DOM without regard then these tools are just not designed to play nicely with SPAs period. I don’t imagine they work on apps like Google Maps/Earth or Notion either?

If there are documentation or even blog posts describing the heuristics of how these tools choose to mutate the DOM and how we are expected to work with them nicely then let’s enumerate what those constraints are and add tests for them specifically.

But it is impossible to just say “let’s be more resilient against external things mutating our DOM”, it’s just way too underspecified. Imagine if someone is randomly mutating variables in your code. How do you be resilient against that? You can’t.

Now let’s talk specifics about what is going on here. When rendering, say, an {{#if}}…{{/if}} block. We need to remember the boundaries (start node and end node) of that block so when the condition changes, we can remove that chunk from the DOM. Sometimes (and maybe even usually) the bounties are text nodes.

It seems like when translating, Google translate have the habit of removing/merging/replacing the text nodes rather than mutating their values in place. Sometimes you get lucky and those text nodes are in between rangers we care about so nobody cares.

But if they remove a text node that we happen to be tracking as a boundary, then you are in for a bad time.

Imagine it looking like this:

Result:

{{#if this.isLoading}}[START]
  loading… ({{this.est}} seconds left)
[END]{{/if}}

Please be patient.

When the condition changes from true to false, we know we need to

let current = bounds.start;

while (true) {
  parent.removeChild(current);
  if (current === bounds.end) break;
  current = current.nextChild;
}

Now what if someone messed with the DOM so either the start or end node, are already gone when the condition flips?

Well let’s see, if the start node is gone we will immediately get an error, which currently breaks the app. We can try to silently discard the error. But then we wouldn’t have cleared the DOM and the screen would be stale and stuck on the loading thing anyway and probably cause other undefined behavior elsewhere (like destructors never running).

If the start is there but the end is gone? Well then we won’t error the same way but we will clear too much DOM and eventually reach the end of the parent’s child nodes list. Which will cause other errors down the road elsewhere again.

Sometimes things work out because the {{/if}} happens to be at the end of the parent anyway, or for any number of reasons. Or perhaps the user clicked away to another page and you need to clear a bigger piece of DOM higher up anyway.

But just as likely maybe all you have accomplished is that you make the error go away on your bugsnag dashboard, but your user is now staring at a subtly broken app anyway.

Again I want to stress that I am not unwilling to fix this because I don’t understand the problem or the pain. But I hope you see why this problem is fundamental because of the broken behavior of whatever is doing the DOM mutation and you really can’t fix it on our side.

One thing can and should do is add better ways to recover from errors, like #1462. But even then, all that would do is to allow you to capture this error higher up, clear that bigger section of the DOM and render an error message in place of it when this does happen.

And in fact, in light of that, I think suppressing the error here is actually bad, because it would propagate the error condition elsewhere rather than letting you catch and recover from it. So I think we will have to revert this whenever that lands.

In which case, maybe using patch package is a better interim solution? I personably wouldn’t use that word, but if it works in the specific variant of the issue you are running into, that’s great. But I really hope everyone on this thread and finding this thread in the future can take a few minutes to try and understand the interaction before deciding how to proceed.

@chancancode
Copy link
Contributor

And - to be extra clear - this is a very generic symptom (node unexpectedly gone when it comes time to remove it) with many possible causes. If we find a case where this is caused by our internal bookkeeping being messed up or the tracking has gone awry somehow, it’s a very different issue and we should definitely fix that. And that’s why this assertion is here. But I don’t think that’s what happening in this case.

@gilest
Copy link

gilest commented Oct 26, 2023

Appreciate your response @chancancode. My findings support what you were suggesting (not simple or even maybe possible to fix) which is why I posted them.

@chancancode
Copy link
Contributor

cross-linking @gilest's workaround on #1401 for context and visibility

@shama
Copy link
Author

shama commented Oct 26, 2023

@chancancode Thank you for the detailed and thoughtful response. Very much appreciated and that makes sense why you wouldn't want to suppress the error. Thanks for linking to that error recovery PR. It's nice to know that's on the horizon as that is ultimately what I'm after here.

The problem I was trying to fix was when our users hit this error, the app stops working all together and there isn't a way for us to recovery without telling users to refresh the page. For our app, ignoring this error has been ok through all the exploratory testing we've done but I'm sure there are some edge cases where it's leaving lingering nodes in the app. I can see how that could be more problematic for other apps though and why we wouldn't want to implement this PR.

I'm happy to close this PR and just continue using yarn patch to ignore the error. Given your response, I'm going to consider a yarn patch for our app that tries to let us handle this error instead. Maybe we could give our user's a notice that a 3rd party extension or translation is likely externally modifying app and issues might occur.

I appreciate your time!

@shama shama closed this Oct 26, 2023
@shama shama deleted the shama-patch-1-1 branch October 26, 2023 03:42
@chancancode
Copy link
Contributor

chancancode commented Oct 26, 2023

@shama Sounds good, I think it's good that your patch and the yarn patch instruction is up here so others can try and see if it helps their particular case.

In Skylight we do have code for rendering an "red box of doom" on the page that uses vanilla DOM code instead of hbs/glimmer ("Sorry something went wrong [Reload]"), that we call when we detect unrecoverable rendering errors.

Or.. you can try to do a hacked version of error recovery thing, the official version need to do this much more carefully and make sure we don't corrupt any internal states, call destructors, etc. But if you are looking for an imperfect solution in the meantime, you may be able to do a good enough version.

Internally the error recovery code, does something verrrrrry loosely like this:

const BOUNDARIES = new WeakMap();

export function didError(element) {
  let boundary;
  
  while (!boundary && element) {
    boundary = BOUNDARIES.get(element);
    if (boundary) {
      break;
    } else {
      element = element.parentElement;
    }
  }
  
  if (boundary) {
    setImmediate(() => boundary.noError = true);
  }
}

export default class ErrorBoundary {
  @tracked noError = true;

  <template>{{#if this.noError}}<div {{did-insert this.register}} {{will-destroy this.unregister}}>{{yield}}</div>{{else}}{{yield to="else"}}{{/if}}</template>

  register = (element) => BOUNDARIES.set(element, this);

  unregister = () => BOUNDARIES.delete(element);
}

Because the {{#if this.noError}}...{{else}}only has exactly one div as a child (it's important that there are no whitespace around it, otherwise you'll be dealing with text nodes again), it's hopefully more likely to survive these random DOM mutation , and hopefully the VM's internal state is coherent enough that, on the next tick it can at least clear out that bigger chunk of DOM above where you ran into trouble.

This code is definitely not robust and it's still very deep in the undefined-behavior-land, but it sounds like we are already there 🤷🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to execute 'removeChild' on 'Node': The node to be removed is not a child of this node.
5 participants