Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349

dwnusbaum · 2024-08-08T21:42:14Z

Related to #347. This PR tries to improve these main issues:

Infinite loops in StandardGraphLookupView.bruteForceScanForEnclosingBlock and LinearBlockHoppingScanner are now detected and throw an exception. This does not fix all possible infinite loops with flow graph iteration APIs, but these are the only two cases I saw where the infinite loop happens internally rather than the caller actually iterating over nodes infinitely. Loops in StandardGraphLookupView.bruteForceScanForEnclosingBlock via PlaceholderTask.getAffinityKey brick the build queue, and loops in LinearBlockHoppingScanner brick the CpsVmExecutorService for the build in question.
When we try to resume a Pipeline but run into something that prevents steps from resuming, we should do what we can to ensure the build does not hang forever waiting for steps to resume.

Testing done

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

…kupView and LinearBlockHoppingScanner and improve resumption error handling

dwnusbaum · 2024-08-08T21:57:20Z

src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java

@@ -362,20 +362,31 @@ private static final class ParallelResumer {
                        nodes.put(n, se);
                    } else {
                        LOGGER.warning(() -> "Could not find FlowNode for " + se + " so it will not be resumed");
+                        // TODO: Should we call se.getContext().onFailure()?


I will try to create tests for all of these cases to see what happens.

Well, it's kind of interesting. Specifically in the case of a missing or corrupt FlowNode, there are two cases that I see:

If FlowExecution.heads contains the bad node, then everything gets handled in CpsFlowExecution.onLoad, we never read program.dat, we create placeholder nodes, the Pipeline fails, great.

If FlowExecution.heads does not contain the bad node, then the Pipeline loads, we attempt to resume it, various warnings get logged, and it hangs forever. The problem is that this call in CpsStepContext throws right away, so the CpsThread never resumes with the outcome, and the Pipeline hangs. It seems like this log message should at least be bumped to WARNING because any errors there are likely to be the proximate cause of a build hanging forever, but I wonder if we should also attempt to call CpsFlowExecution.croak or CpsVmExecutorService.reportProblem in that catch block.

So long story short, it seems like there is no point trying to do anything special in these cases in this plugin for now.

jenkinsci/workflow-cps-plugin#916 updates the relevant log message to WARNING so that we have an idea if the problematic situation is ever occurring in practice.

dwnusbaum · 2024-08-08T22:00:25Z

src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java

                    }
                }
+            } catch (Exception e) {
+                // TODO: Should we instead just try to resume steps with no respect to topological order?


Not sure if this would be better or worse than killing all of the steps. If you end up here, something is pretty wrong with your flow graph. The main causes I can think of are if a BlockEndNode.startId points to a node that either doesn't exist, is not a BlockStartNode, or (the new case I just saw) points to a BlockStartNode with a newer ID than the end node, creating a cycle in the graph.

The question though is whether flow graph corruption necessarily means that trying to resume the steps is pointless, or if there is a decent chance that if we fall back to synchronous resumption of each step execution in order, the build might be able to complete successfully.

Ah I was just looking for #221 the other day as a reference, because its original motivating use case is obsolete as of jenkinsci/ssh-agent-plugin#152. Of course there might still be other steps which legitimately need context from an enclosing step in their onResume, but I am not aware of any offhand. So I think it would be reasonable to fall back to resuming all steps in random order. I would not expect the build to complete successfully, but it might be able to e.g. release external resources more gracefully.

c6b4fa9 adjusts things so we resume all step executions as long as they have a FlowNode. #349 (comment) discusses the cases where the FlowNode is missing and/or corrupted.

Avoid infinite loops due to corrupted flow graphs in StandardGraphLoo…

7b934f1

…kupView and LinearBlockHoppingScanner and improve resumption error handling

dwnusbaum mentioned this pull request Aug 8, 2024

Prevent StepExecutionIterator from leaking memory in cases where a single processed execution has a stuck CPS VM thread #347

Merged

6 tasks

dwnusbaum commented Aug 8, 2024

View reviewed changes

jglick added the bug label Aug 9, 2024

dwnusbaum added 2 commits August 9, 2024 17:15

Adjust implementation to attempt to resume any step that has a FlowNode

c6b4fa9

Merge branch 'master' into infinite-loop-memory-leak-2

c406ccc

dwnusbaum mentioned this pull request Aug 12, 2024

Improve support bundle data for running Pipelines, update support bundle component names and file paths, and update log warning for a particular type of error during step completion jenkinsci/workflow-cps-plugin#916

Merged

6 tasks

dwnusbaum marked this pull request as ready for review August 12, 2024 20:41

dwnusbaum requested a review from a team as a code owner August 12, 2024 20:41

jglick approved these changes Aug 13, 2024

View reviewed changes

dwnusbaum merged commit ee415d9 into jenkinsci:master Aug 13, 2024
16 checks passed

dwnusbaum deleted the infinite-loop-memory-leak-2 branch August 13, 2024 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349

Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349

dwnusbaum commented Aug 8, 2024 •

edited

Loading

dwnusbaum Aug 8, 2024

dwnusbaum Aug 9, 2024 •

edited

Loading

dwnusbaum Aug 12, 2024

dwnusbaum Aug 8, 2024

jglick Aug 9, 2024

dwnusbaum Aug 9, 2024 •

edited

Loading

Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349

Avoid infinite loops due to corrupted flow graphs in some cases and improve resumption error handling #349

Conversation

dwnusbaum commented Aug 8, 2024 • edited Loading

Testing done

Submitter checklist

dwnusbaum Aug 8, 2024

Choose a reason for hiding this comment

dwnusbaum Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

dwnusbaum Aug 12, 2024

Choose a reason for hiding this comment

dwnusbaum Aug 8, 2024

Choose a reason for hiding this comment

jglick Aug 9, 2024

Choose a reason for hiding this comment

dwnusbaum Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

dwnusbaum commented Aug 8, 2024 •

edited

Loading

dwnusbaum Aug 9, 2024 •

edited

Loading

dwnusbaum Aug 9, 2024 •

edited

Loading