Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TracingRiverMarshallerFactory to diagnose StackOverflowError deserializing program.dat #239

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jglick
Copy link
Member

@jglick jglick commented Sep 28, 2023

The test reproduces a stack trace very similar to what was observed in a Jenkins controller which was unable to resume a particular pipeline:

…
at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader$TracingRiverMarshallerFactory$1.doReadNewObject(RiverReader.java:391)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:298)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:246)
at org.jboss.marshalling.river.RiverUnmarshaller.readFields(RiverUnmarshaller.java:1918)
at org.jboss.marshalling.river.RiverUnmarshaller.doInitSerializable(RiverUnmarshaller.java:1831)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1421)
at org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader$TracingRiverMarshallerFactory$1.doReadNewObject(RiverReader.java:391)
…

I am not yet sure if the information being captured here can actually be used to diagnose the issue, but it is worth a try.

Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -151,7 +153,7 @@ public ListenableFuture<Unmarshaller> restorePickles(Collection<ListenableFuture
config.setClassResolver(new SimpleClassResolver(classLoader));
//config.setSerializabilityChecker(new SerializabilityCheckerImpl());
config.setObjectResolver(combine(evr, ownerResolver));
Unmarshaller eu = new RiverMarshallerFactory().createUnmarshaller(config);
Unmarshaller eu = new TracingRiverMarshallerFactory().createUnmarshaller(config);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overridden methods will increase the stack depth by 1 for each nested object, so if issues like JENKINS-52966 are not due to cycles but just deeply nested objects because of the way Declarative is designed (which seems quite possible given user comments in that issue), this will make the problem worse. Could we somehow use the old code path by default, but then whenever we hit a StackOverflowError re-deserialize things with TracingRiverMarshallerFactory enabled? Or maybe only enable tracing if a system property is set?

Copy link
Member Author

@jglick jglick Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. If this PR does turn out to be capable of diagnosing an issue in this area, we would need to decide how to activate it. Besides the issue you mention, it will add some heap usage when loading normal programs.

@Test public void stackOverflow() throws Throwable {
rr.then(r -> {
WorkflowJob p = r.createProject(WorkflowJob.class, "p");
p.setDefinition(new CpsFlowDefinition("class R {Object f}; def x = new R(); for (int i = 0; i < 1000; i++) {def y = new R(); y.f = x; x = y}; semaphore 'wait'", true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fails for me locally because there is no StackOverflowError. It seems the minimum number of iterations needed in my environment is around 1774, so maybe bump this?

Suggested change
p.setDefinition(new CpsFlowDefinition("class R {Object f}; def x = new R(); for (int i = 0; i < 1000; i++) {def y = new R(); y.f = x; x = y}; semaphore 'wait'", true));
p.setDefinition(new CpsFlowDefinition("class R {Object f}; def x = new R(); for (int i = 0; i < 2000; i++) {def y = new R(); y.f = x; x = y}; semaphore 'wait'", true));

There's kind of a fine line though, because if you go much farther then I start to get a StackOverflowError during serialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I had to tune the number to get a SOE during deser but not ser. Not sure how to make a test of this reliable. and maybe it should just be @Ignored and left there for use when experimenting.

Comment on lines +392 to +395
} catch (StackOverflowError x) {
// Will cause a StreamCorruptionError eventually, but not before we capture the parent objects:
o = null;
err = x;
Copy link
Member

@dwnusbaum dwnusbaum Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From some local testing, I think swallowing the StackOverflowError like this also has the side effect of making the standard tracing mechanism in JBoss marshalling work, at least to some extent, because it seems to have around 100 fewer entries than the custom trace added in this PR. For example, along with the new trace I also see the following, which does not exist without the PR:

Caused by: an exception which occurred:
	in field com.cloudbees.groovy.cps.impl.BlockScopeEnv.types
	in object com.cloudbees.groovy.cps.impl.BlockScopeEnv@2ffb8577
	in object of type com.cloudbees.groovy.cps.impl.BlockScopeEnv
	in field com.cloudbees.groovy.cps.impl.ProxyEnv.parent
	in object com.cloudbees.groovy.cps.impl.TryBlockEnv@64d0aa4f
	in object of type com.cloudbees.groovy.cps.impl.TryBlockEnv
	in field com.cloudbees.groovy.cps.impl.ProxyEnv.parent
	in object com.cloudbees.groovy.cps.impl.BlockScopeEnv@17735b9f
	in object of type com.cloudbees.groovy.cps.impl.BlockScopeEnv
	in field com.cloudbees.groovy.cps.impl.ProxyEnv.parent
        ...

At one point I tried to augment the CPS Env objects to carry a SourceLocation that we could then use to augment JBoss Marshalling trace information to make it easier to understand where the objects are coming from, but I never managed to get things to work in a way that was useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that trace is due to insertion of null into a place where an Object is expected, so a secondary error that we intentionally generate as the price of being able to continue walking back up the stack and collecting returning values from readObject. AFAICT it tells you nothing about the primary StackOverflowError.

The basic problem is that given the protected methods available in RiverUnmarshaller (readFields does not count because one of its args is package-scope!) all we can do is record what readObject successfully returns, but we cannot intercept what class it is starting to load. So the only thing we can do is attempt to force hundreds of readObject stack frames to return normally, noting the class name being loaded in each, before going back and marking the whole deserialization as a failure. This is clearly inferior to what you would be able to get from, say, a parser event listener with some kind of startElement method.

@dwnusbaum
Copy link
Member

dwnusbaum commented Sep 28, 2023

Just for the record, here is a workflow-cps patch that can be used to deserialize CpsThreadGroup XML dumps, then serialize and deserialize them again using JBoss Marshalling to try to explore issues like this. To actually use it you will have to manually clean up a few things in the XML dump, account for libraries and other aspects of the specific Pipeline in question, and adjust the stack size modifications for your environment, but this is a useful starting point:

Patch
diff --git a/plugin/pom.xml b/plugin/pom.xml
index 913e2d21..871a1477 100644
--- a/plugin/pom.xml
+++ b/plugin/pom.xml
@@ -44,6 +44,7 @@
         <no-test-jar>false</no-test-jar>
         <node.version>18.15.0</node.version>
         <yarn.version>1.22.19</yarn.version>
+        <argLine>-Xss512k -Xms768M -Xmx768M -XX:+HeapDumpOnOutOfMemoryError -XX:+TieredCompilation -XX:TieredStopAtLevel=1 @{jenkins.addOpens} @{jenkins.insaneHook}</argLine>
     </properties>
     <dependencyManagement>
         <dependencies>
@@ -72,6 +73,7 @@
         <dependency>
             <groupId>org.jenkins-ci.plugins.workflow</groupId>
             <artifactId>workflow-support</artifactId>
+            <version>866.v23ff92793deb_</version>
         </dependency>
         <dependency>
             <groupId>org.jenkins-ci.plugins.workflow</groupId>
@@ -114,6 +116,7 @@
             <groupId>org.jenkins-ci.plugins.workflow</groupId>
             <artifactId>workflow-support</artifactId>
             <classifier>tests</classifier>
+            <version>866.v23ff92793deb_</version>
             <scope>test</scope>
         </dependency>
         <dependency>
@@ -152,6 +155,11 @@
             <artifactId>pipeline-input-step</artifactId>
             <scope>test</scope>
         </dependency>
+        <dependency>
+            <groupId>org.jenkinsci.plugins</groupId>
+            <artifactId>pipeline-model-definition</artifactId>
+            <scope>test</scope>
+        </dependency>
         <dependency>
             <groupId>org.jenkins-ci.plugins</groupId>
             <artifactId>junit</artifactId>
diff --git a/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java b/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
index c021db45..54971079 100644
--- a/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
+++ b/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
@@ -570,7 +570,7 @@ public class CpsFlowExecution extends FlowExecution implements BlockableResume {
         return LoggingInvoker.create(isSandbox());
     }
 
-    private CpsScript parseScript() throws IOException {
+    CpsScript parseScript() throws IOException {
         // classloader hierarchy. See doc/classloader.md
         CpsScript s;
         try {
diff --git a/plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/ProgramSerializationUtilTest.java b/plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/ProgramSerializationUtilTest.java
new file mode 100644
index 00000000..f8e2bbce
--- /dev/null
+++ b/plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/ProgramSerializationUtilTest.java
@@ -0,0 +1,99 @@
+package org.jenkinsci.plugins.workflow.cps;
+
+import com.thoughtworks.xstream.XStream;
+import hudson.util.XStream2;
+import java.io.File;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.regex.Pattern;
+import org.jenkinsci.plugins.workflow.job.WorkflowJob;
+import org.jenkinsci.plugins.workflow.job.WorkflowRun;
+import org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+import org.jvnet.hudson.test.JenkinsRule;
+
+/*
+ * The MIT License
+ *
+ * Copyright 2023 CloudBees, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+public class ProgramSerializationUtilTest {
+    @Rule public JenkinsRule r = new JenkinsRule();
+    @Rule public TemporaryFolder temp = new TemporaryFolder();
+
+    @Test
+    public void deserializeXmlDumpThenSerializeAndDeserializeProgramDotDat() throws Exception {
+        // Note that the XML dump fails to escape control characters, e.g. `<init>` for contructors and `<anonymous>`,
+        // so they need to be fixed manually.
+        // Also DescribablePickleFactory (cloudbees-workflow-template) breaks the dump, you have to delete the `<xml>`
+        // tag and in some cases the parent of that tag as well. `ExecutorStepExecution.step` needed similar but
+        // slightly different massaging for unknown reasons.
+        Path xmlDump = Paths.get("program.xml");
+        // Sacrificial execution
+        WorkflowJob p = r.createProject(WorkflowJob.class);
+        // We have to use Declarative so that `ModelInterpreter.groovy` and related classes exist in the script's classloader.
+        p.setDefinition(new CpsFlowDefinition("pipeline { agent none \n stages { stage('test') { steps { echo 'Hello, world!' } } } }", true));
+        WorkflowRun b = r.buildAndAssertSuccess(p);
+        CpsFlowExecution e = (CpsFlowExecution) b.getExecution();
+
+        CpsThreadGroup ctg = new CpsThreadGroup(e);
+        CpsFlowExecution.PROGRAM_STATE_SERIALIZATION.set(e);
+        XStream xs = new XStream();
+        // Awkward, but we need a classloader with `ModelInterpreter.groovy` loaded.
+        xs.setClassLoader(e.parseScript().getClass().getClassLoader());
+        // Beware XXE!
+        xs.allowTypesByRegExp(new Pattern[] { Pattern.compile(".*") });
+        // Add mocks for all library vars/. Search for `serialization="custom"` in the XML file as a starting point.
+        // TODO: Set up a script to do this dynamically using a build folder, which is going to be necessary anyway for
+        // people using src/.
+        addAliasesForLibraryVars(xs, "var1", "var2");
+        xs.alias("WorkflowScript", CpsScriptImpl.class);
+        xs.unmarshal(XStream2.getDefaultDriver().createReader(xmlDump.toFile()), ctg);
+        //System.out.println(ctg.asXml()); // Sanity check
+        File tempFile = temp.newFile();
+        CpsVmExecutorService.CURRENT.set(ctg);
+        ctg.saveProgram(tempFile);
+        RiverReader r = new RiverReader(tempFile, e.parseScript().getClass().getClassLoader(), b.asFlowExecutionOwner());
+        CpsThreadGroup ctg2 = (CpsThreadGroup) r.restorePickles(new ArrayList<>()).get().readObject();
+        //System.out.println(ctg2.asXml()); // Sanity check
+    }
+
+    private static void addAliasesForLibraryVars(XStream xs, String... vars) {
+        for (String var : vars) {
+            xs.alias(var, CpsScriptImpl.class);
+        }
+    }
+
+    // Note: For complex scripts you may need to adjust this class. Search for `<WorkflowScript` in the XML file to see what is needed.
+    private static class CpsScriptImpl extends CpsScript {
+        public CpsScriptImpl() throws Exception {
+            super();
+        }
+        @Override
+        public Object run() {
+            throw new AssertionError("unused");
+        }
+    }
+}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants