-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TracingRiverMarshallerFactory
to diagnose StackOverflowError
deserializing program.dat
#239
base: master
Are you sure you want to change the base?
Conversation
…rializing `program.dat`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -151,7 +153,7 @@ public ListenableFuture<Unmarshaller> restorePickles(Collection<ListenableFuture | |||
config.setClassResolver(new SimpleClassResolver(classLoader)); | |||
//config.setSerializabilityChecker(new SerializabilityCheckerImpl()); | |||
config.setObjectResolver(combine(evr, ownerResolver)); | |||
Unmarshaller eu = new RiverMarshallerFactory().createUnmarshaller(config); | |||
Unmarshaller eu = new TracingRiverMarshallerFactory().createUnmarshaller(config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overridden methods will increase the stack depth by 1 for each nested object, so if issues like JENKINS-52966 are not due to cycles but just deeply nested objects because of the way Declarative is designed (which seems quite possible given user comments in that issue), this will make the problem worse. Could we somehow use the old code path by default, but then whenever we hit a StackOverflowError
re-deserialize things with TracingRiverMarshallerFactory
enabled? Or maybe only enable tracing if a system property is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. If this PR does turn out to be capable of diagnosing an issue in this area, we would need to decide how to activate it. Besides the issue you mention, it will add some heap usage when loading normal programs.
@Test public void stackOverflow() throws Throwable { | ||
rr.then(r -> { | ||
WorkflowJob p = r.createProject(WorkflowJob.class, "p"); | ||
p.setDefinition(new CpsFlowDefinition("class R {Object f}; def x = new R(); for (int i = 0; i < 1000; i++) {def y = new R(); y.f = x; x = y}; semaphore 'wait'", true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test fails for me locally because there is no StackOverflowError
. It seems the minimum number of iterations needed in my environment is around 1774, so maybe bump this?
p.setDefinition(new CpsFlowDefinition("class R {Object f}; def x = new R(); for (int i = 0; i < 1000; i++) {def y = new R(); y.f = x; x = y}; semaphore 'wait'", true)); | |
p.setDefinition(new CpsFlowDefinition("class R {Object f}; def x = new R(); for (int i = 0; i < 2000; i++) {def y = new R(); y.f = x; x = y}; semaphore 'wait'", true)); |
There's kind of a fine line though, because if you go much farther then I start to get a StackOverflowError
during serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I had to tune the number to get a SOE during deser but not ser. Not sure how to make a test of this reliable. and maybe it should just be @Ignore
d and left there for use when experimenting.
} catch (StackOverflowError x) { | ||
// Will cause a StreamCorruptionError eventually, but not before we capture the parent objects: | ||
o = null; | ||
err = x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From some local testing, I think swallowing the StackOverflowError
like this also has the side effect of making the standard tracing mechanism in JBoss marshalling work, at least to some extent, because it seems to have around 100 fewer entries than the custom trace added in this PR. For example, along with the new trace I also see the following, which does not exist without the PR:
Caused by: an exception which occurred:
in field com.cloudbees.groovy.cps.impl.BlockScopeEnv.types
in object com.cloudbees.groovy.cps.impl.BlockScopeEnv@2ffb8577
in object of type com.cloudbees.groovy.cps.impl.BlockScopeEnv
in field com.cloudbees.groovy.cps.impl.ProxyEnv.parent
in object com.cloudbees.groovy.cps.impl.TryBlockEnv@64d0aa4f
in object of type com.cloudbees.groovy.cps.impl.TryBlockEnv
in field com.cloudbees.groovy.cps.impl.ProxyEnv.parent
in object com.cloudbees.groovy.cps.impl.BlockScopeEnv@17735b9f
in object of type com.cloudbees.groovy.cps.impl.BlockScopeEnv
in field com.cloudbees.groovy.cps.impl.ProxyEnv.parent
...
At one point I tried to augment the CPS Env
objects to carry a SourceLocation
that we could then use to augment JBoss Marshalling trace information to make it easier to understand where the objects are coming from, but I never managed to get things to work in a way that was useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that trace is due to insertion of null
into a place where an Object
is expected, so a secondary error that we intentionally generate as the price of being able to continue walking back up the stack and collecting returning values from readObject
. AFAICT it tells you nothing about the primary StackOverflowError
.
The basic problem is that given the protected
methods available in RiverUnmarshaller
(readFields
does not count because one of its args is package-scope!) all we can do is record what readObject
successfully returns, but we cannot intercept what class it is starting to load. So the only thing we can do is attempt to force hundreds of readObject
stack frames to return normally, noting the class name being loaded in each, before going back and marking the whole deserialization as a failure. This is clearly inferior to what you would be able to get from, say, a parser event listener with some kind of startElement
method.
Just for the record, here is a Patchdiff --git a/plugin/pom.xml b/plugin/pom.xml
index 913e2d21..871a1477 100644
--- a/plugin/pom.xml
+++ b/plugin/pom.xml
@@ -44,6 +44,7 @@
<no-test-jar>false</no-test-jar>
<node.version>18.15.0</node.version>
<yarn.version>1.22.19</yarn.version>
+ <argLine>-Xss512k -Xms768M -Xmx768M -XX:+HeapDumpOnOutOfMemoryError -XX:+TieredCompilation -XX:TieredStopAtLevel=1 @{jenkins.addOpens} @{jenkins.insaneHook}</argLine>
</properties>
<dependencyManagement>
<dependencies>
@@ -72,6 +73,7 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-support</artifactId>
+ <version>866.v23ff92793deb_</version>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
@@ -114,6 +116,7 @@
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-support</artifactId>
<classifier>tests</classifier>
+ <version>866.v23ff92793deb_</version>
<scope>test</scope>
</dependency>
<dependency>
@@ -152,6 +155,11 @@
<artifactId>pipeline-input-step</artifactId>
<scope>test</scope>
</dependency>
+ <dependency>
+ <groupId>org.jenkinsci.plugins</groupId>
+ <artifactId>pipeline-model-definition</artifactId>
+ <scope>test</scope>
+ </dependency>
<dependency>
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>junit</artifactId>
diff --git a/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java b/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
index c021db45..54971079 100644
--- a/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
+++ b/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
@@ -570,7 +570,7 @@ public class CpsFlowExecution extends FlowExecution implements BlockableResume {
return LoggingInvoker.create(isSandbox());
}
- private CpsScript parseScript() throws IOException {
+ CpsScript parseScript() throws IOException {
// classloader hierarchy. See doc/classloader.md
CpsScript s;
try {
diff --git a/plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/ProgramSerializationUtilTest.java b/plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/ProgramSerializationUtilTest.java
new file mode 100644
index 00000000..f8e2bbce
--- /dev/null
+++ b/plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/ProgramSerializationUtilTest.java
@@ -0,0 +1,99 @@
+package org.jenkinsci.plugins.workflow.cps;
+
+import com.thoughtworks.xstream.XStream;
+import hudson.util.XStream2;
+import java.io.File;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.regex.Pattern;
+import org.jenkinsci.plugins.workflow.job.WorkflowJob;
+import org.jenkinsci.plugins.workflow.job.WorkflowRun;
+import org.jenkinsci.plugins.workflow.support.pickles.serialization.RiverReader;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+import org.jvnet.hudson.test.JenkinsRule;
+
+/*
+ * The MIT License
+ *
+ * Copyright 2023 CloudBees, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+public class ProgramSerializationUtilTest {
+ @Rule public JenkinsRule r = new JenkinsRule();
+ @Rule public TemporaryFolder temp = new TemporaryFolder();
+
+ @Test
+ public void deserializeXmlDumpThenSerializeAndDeserializeProgramDotDat() throws Exception {
+ // Note that the XML dump fails to escape control characters, e.g. `<init>` for contructors and `<anonymous>`,
+ // so they need to be fixed manually.
+ // Also DescribablePickleFactory (cloudbees-workflow-template) breaks the dump, you have to delete the `<xml>`
+ // tag and in some cases the parent of that tag as well. `ExecutorStepExecution.step` needed similar but
+ // slightly different massaging for unknown reasons.
+ Path xmlDump = Paths.get("program.xml");
+ // Sacrificial execution
+ WorkflowJob p = r.createProject(WorkflowJob.class);
+ // We have to use Declarative so that `ModelInterpreter.groovy` and related classes exist in the script's classloader.
+ p.setDefinition(new CpsFlowDefinition("pipeline { agent none \n stages { stage('test') { steps { echo 'Hello, world!' } } } }", true));
+ WorkflowRun b = r.buildAndAssertSuccess(p);
+ CpsFlowExecution e = (CpsFlowExecution) b.getExecution();
+
+ CpsThreadGroup ctg = new CpsThreadGroup(e);
+ CpsFlowExecution.PROGRAM_STATE_SERIALIZATION.set(e);
+ XStream xs = new XStream();
+ // Awkward, but we need a classloader with `ModelInterpreter.groovy` loaded.
+ xs.setClassLoader(e.parseScript().getClass().getClassLoader());
+ // Beware XXE!
+ xs.allowTypesByRegExp(new Pattern[] { Pattern.compile(".*") });
+ // Add mocks for all library vars/. Search for `serialization="custom"` in the XML file as a starting point.
+ // TODO: Set up a script to do this dynamically using a build folder, which is going to be necessary anyway for
+ // people using src/.
+ addAliasesForLibraryVars(xs, "var1", "var2");
+ xs.alias("WorkflowScript", CpsScriptImpl.class);
+ xs.unmarshal(XStream2.getDefaultDriver().createReader(xmlDump.toFile()), ctg);
+ //System.out.println(ctg.asXml()); // Sanity check
+ File tempFile = temp.newFile();
+ CpsVmExecutorService.CURRENT.set(ctg);
+ ctg.saveProgram(tempFile);
+ RiverReader r = new RiverReader(tempFile, e.parseScript().getClass().getClassLoader(), b.asFlowExecutionOwner());
+ CpsThreadGroup ctg2 = (CpsThreadGroup) r.restorePickles(new ArrayList<>()).get().readObject();
+ //System.out.println(ctg2.asXml()); // Sanity check
+ }
+
+ private static void addAliasesForLibraryVars(XStream xs, String... vars) {
+ for (String var : vars) {
+ xs.alias(var, CpsScriptImpl.class);
+ }
+ }
+
+ // Note: For complex scripts you may need to adjust this class. Search for `<WorkflowScript` in the XML file to see what is needed.
+ private static class CpsScriptImpl extends CpsScript {
+ public CpsScriptImpl() throws Exception {
+ super();
+ }
+ @Override
+ public Object run() {
+ throw new AssertionError("unused");
+ }
+ }
+}
|
The test reproduces a stack trace very similar to what was observed in a Jenkins controller which was unable to resume a particular pipeline:
I am not yet sure if the information being captured here can actually be used to diagnose the issue, but it is worth a try.