Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349

Merged
merged 3 commits into from
Aug 13, 2024

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Aug 8, 2024

Related to #347. This PR tries to improve these main issues:

  1. Infinite loops in StandardGraphLookupView.bruteForceScanForEnclosingBlock and LinearBlockHoppingScanner are now detected and throw an exception. This does not fix all possible infinite loops with flow graph iteration APIs, but these are the only two cases I saw where the infinite loop happens internally rather than the caller actually iterating over nodes infinitely. Loops in StandardGraphLookupView.bruteForceScanForEnclosingBlock via PlaceholderTask.getAffinityKey brick the build queue, and loops in LinearBlockHoppingScanner brick the CpsVmExecutorService for the build in question.
  2. When we try to resume a Pipeline but run into something that prevents steps from resuming, we should do what we can to ensure the build does not hang forever waiting for steps to resume.

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

…kupView and LinearBlockHoppingScanner and improve resumption error handling
@@ -362,20 +362,31 @@ private static final class ParallelResumer {
nodes.put(n, se);
} else {
LOGGER.warning(() -> "Could not find FlowNode for " + se + " so it will not be resumed");
// TODO: Should we call se.getContext().onFailure()?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to create tests for all of these cases to see what happens.

Copy link
Member Author

@dwnusbaum dwnusbaum Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's kind of interesting. Specifically in the case of a missing or corrupt FlowNode, there are two cases that I see:

  1. If FlowExecution.heads contains the bad node, then everything gets handled in CpsFlowExecution.onLoad, we never read program.dat, we create placeholder nodes, the Pipeline fails, great.
  2. If FlowExecution.heads does not contain the bad node, then the Pipeline loads, we attempt to resume it, various warnings get logged, and it hangs forever. The problem is that this call in CpsStepContext throws right away, so the CpsThread never resumes with the outcome, and the Pipeline hangs. It seems like this log message should at least be bumped to WARNING because any errors there are likely to be the proximate cause of a build hanging forever, but I wonder if we should also attempt to call CpsFlowExecution.croak or CpsVmExecutorService.reportProblem in that catch block.

So long story short, it seems like there is no point trying to do anything special in these cases in this plugin for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jenkinsci/workflow-cps-plugin#916 updates the relevant log message to WARNING so that we have an idea if the problematic situation is ever occurring in practice.

}
}
} catch (Exception e) {
// TODO: Should we instead just try to resume steps with no respect to topological order?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this would be better or worse than killing all of the steps. If you end up here, something is pretty wrong with your flow graph. The main causes I can think of are if a BlockEndNode.startId points to a node that either doesn't exist, is not a BlockStartNode, or (the new case I just saw) points to a BlockStartNode with a newer ID than the end node, creating a cycle in the graph.

The question though is whether flow graph corruption necessarily means that trying to resume the steps is pointless, or if there is a decent chance that if we fall back to synchronous resumption of each step execution in order, the build might be able to complete successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I was just looking for #221 the other day as a reference, because its original motivating use case is obsolete as of jenkinsci/ssh-agent-plugin#152. Of course there might still be other steps which legitimately need context from an enclosing step in their onResume, but I am not aware of any offhand. So I think it would be reasonable to fall back to resuming all steps in random order. I would not expect the build to complete successfully, but it might be able to e.g. release external resources more gracefully.

Copy link
Member Author

@dwnusbaum dwnusbaum Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c6b4fa9 adjusts things so we resume all step executions as long as they have a FlowNode. #349 (comment) discusses the cases where the FlowNode is missing and/or corrupted.

@jglick jglick added the bug label Aug 9, 2024
@dwnusbaum dwnusbaum merged commit ee415d9 into jenkinsci:master Aug 13, 2024
16 checks passed
@dwnusbaum dwnusbaum deleted the infinite-loop-memory-leak-2 branch August 13, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants