-
Notifications
You must be signed in to change notification settings - Fork 45
On the trail of the ERREXIT bug in OpenBSD
This issue seems like a promising gateway drug to get me over my terror at the scope of the problem we're tackling, and get me moving again. So I've been spelunking down the sources of /bin/sh
.
The sources for /bin/sh
are at /usr/src/bin/ksh
(which runs in Bourne shell mode if argv[0]
is called 'sh').
main.c: set -e
sets FERREXIT
. FERREXIT
is checked in unwind()
to decide on the exit status of a command.
When I instrument all calls to unwind()
while reproducing the bug, I find that:
- when y0.c is newer than y0.o (everything working correctly), control gets to a call to
unwind(LERROR)
at the end ofexecute()
(exec.c) - when y0.c is older than y0.o (buggy), control instead ends up at a call to
unwind(LEXIT)
at the final call-site inshell()
(main.c)
In the buggy case, it fails to get to unwind(LEXIT)
because *xerrok
is 1.
I think I see the problem. First some background:
-
execute()
is a recursive function. Calling it on a list of commands results in recursive calls for each command, calling it on a pipeline or&&
or||
of commands results in recursive calls for each pipe stage or branch. - One of the arguments to
execute()
isint* xerrok
which ripples through (almost) all recursive calls.
The problem, I think, is this fragment of code:
case TOR:
case TAND:
rv = execute(t->left, XERROK, xerrok);
if ((rv == 0) == (t->type == TAND))
rv = execute(t->right, flags & XERROK, xerrok); // <=== one of the recursive calls
else {
flags |= XERROK;
if (xerrok)
*xerrok = 1; // <=== disable `set -e` for *all* future execution!
}
break;
Instrumenting some printf
s confirms that *xerrok
is set even in subsequent statements in a for
loop, after the &&
compound has completed.
Took me a while to figure out where in the world xerrok
is being allocated. Answer: it's internal to execute()
at the line:
if (xerrok == NULL)
xerrok = &dummy;
So execute()
is always called with NULL xerrok
at the top-most level.
Why do we even need this variable? Now creating a few test scripts to try to answer that..
Another place where clues may be found is the CVS history.
The xerrok
parameter was introduced in revision 1.49 of exec.c (jaredy, Jan 2009). Since then it's been tweaked just once, in revision 1.50 (millert, Jun 2013), fixing a bug in precisely the TAND
case I've been looking at. That gives me some confidence there may still be a bug here.
Commit 1.50 makes mention of regression tests, and I remember that Stephen showed me where they were: /usr/src/regress/bin/ksh
.
You run the regression tests by running:
/usr/src $ make regression-tests
I'm seeing some failures in my OpenBSD 5.9 release. Maybe because I'm running on zsh?
In any case, make
output shows that the command for running ksh regression tests is:
/usr/bin/perl /usr/src/regress/bin/ksh/th -s /usr/src/regress/bin/ksh -p /bin/ksh -C pdksh,sh,ksh,posix,posix-upu -T /usr/src/regress/bin/ksh/obj
Looks like no other files were modified alongside exec.c rev 1.50.
I did uncover a related change from 2 days earlier while searching for it. After some reorg in Dec 2013, the current location of set -e
tests is obsd-regress.t
(just search for ' -e'). Looks like there's no tests combining for
loops with set -e
.
Ok, I have a short command to run the relevant regression test that currently passes from /usr/src/regress/bin/ksh
:
perl th -s obsd-regress.t -p /usr/obj/bin/ksh/ksh -C sh -T /usr/src/regress/bin/ksh/obj
(-T
seems to require an absolute path. Is it because obj
is a symlink?)
Nice prompt email response from Todd Miller who added xerrok
(millert above). He concurs that this seems like a bug.
A little spelunking into the th
test-handler script in Perl yields a failing regression test:
name: and-list-error-3
description:
Test combination of && and for loop in -e mode
# file foo0 is absent; foo1 is present
file-setup: file 644 "foo1"
# file bar0 is present; bar1 is absent
file-setup: file 644 "bar0"
arguments: !-e!
# first iteration of for loop errors at first branch of &&
# execution should continue in spite of -e mode
# second iteration of for loop errors at second branch of &&
# -e mode should trigger exit
stdin:
for f in 0 1
do
test -f foo$f && test -f bar$f
done
echo "should not print"
expected-exit: e != 0
---
When I rip out the xerrok
parameter entirely from the ksh
codebase the above regression test passes. In fact, only two tests fail out of 11 tests for set -e
: seterror-1
and seterror-2
. Both have to do with running an &&
chain inside an if
. Looks like the effort to fix if
broke for
loops.
Now looking more closely at the jaredy (Jared Yanovich) commit from Jan 2009.
Ok, I'm not smart enough for this. Ideas:
- Rip out
xerrok
entirely.seterror-1
andseterror-2
fail. - Reset
*xerrok
after each command inTFOR
. All tests pass(!) but it's too big a sledgehammer; this test fails:
name: seterror-9
description:
The -e flag within a for statement should terminate && or || chains on
failure in rightmost command.
stdin:
for f in 0; do true && false; fi
true
arguments: !-e!
expected-exit: e != 0
- Replace the
TOR
/TAND
case with a fragment modeled onTLIST
; instead of recursing on each command, iterate through all the commands:
case TOR:
case TAND:
while (t->type == TOR || t->type == TAND) {
rv = execute(t->left, XERROK, NULL); // NULL => xerrok: `and-list-error-3` fails; seterror-1 and `seterror-2` pass
if ((rv == 0) == (t->type == TOR)) {
flags |= XERROK;
*xerrok = 1;
goto Break;
}
t = t->right;
}
rv = execute(t, flags & XERROK, xerrok);
This passes my new regression test and-list-error-3
, but fails seterror-1
, seterror-2
and my new seterror-9
above. It feels closest to an elegant solution, but I need somehow to influence the current unwind before restoring the variable introduced in the previous stackframe.
One thing we have achieved: if (xerrok)
inside execute()
is utterly unnecessary. xerrok
is always set if it's NULL on entry.
Miscellaneous annoyance: Motherfucking CVS, where I can't see all the changes made together. I can't revert or stash changes in a group, so juggling these different ideas creates a context switch and causes me to lose state.
A new morning, a new plan of attack.
A couple of functions have been useful to run from my base camp at /usr/src
as part of my debug loop:
$ build() { cd bin/ksh; make; RET=$?; popd; return $RET; }
$ run_test() { cd regress/bin/ksh; perl th -s obsd-regress.t -p /usr/obj/bin/ksh/ksh -C sh -T /usr/src/regress/bin/ksh/obj; popd; }
Now I can run build && run_test
to try out any changes I make.
Solved. Turns out the answer is this change to for
loops:
if (t->type == TFOR) {
+ save_xerrok = *xerrok;
while (*ap != NULL) {
setstr(global(t->str), *ap++, KSH_UNWIND_ERROR);
+ /* undo xerrok in all iterations except the last */
+ *xerrok = save_xerrok;
rv = execute(t->left, flags & XERROK, xerrok);
}
+ /* ripple xerrok set at final iteration of loop */
}
Now investigating if we need to similarly change while
and until
loops as well..
Realized after sending out the patch today that I'd never recompiled the entire userspace with my change. Tried that -- and the build turned out to be corrupted yet again. Creating a new server from scratch. This time create a snapshot before messing with it.
Yes, the patch is fine. Accepted split up into multiple commits: 1 2 3 4 5 6. Major commits are 3 for the fix and 5 for the regression test.