On the trail of the ERREXIT bug in OpenBSD

This issue seems like a promising gateway drug to get me over my terror at the scope of the problem we're tackling, and get me moving again. So I've been spelunking down the sources of /bin/sh.

The sources for /bin/sh are at /usr/src/bin/ksh (which runs in Bourne shell mode if argv[0] is called 'sh').

main.c: set -e sets FERREXIT. FERREXIT is checked in unwind() to decide on the exit status of a command.

When I instrument all calls to unwind() while reproducing the bug, I find that:

when y0.c is newer than y0.o (everything working correctly), control gets to a call to unwind(LERROR) at the end of execute() (exec.c)
when y0.c is older than y0.o (buggy), control instead ends up at a call to unwind(LEXIT) at the final call-site in shell() (main.c)

In the buggy case, it fails to get to unwind(LEXIT) because *xerrok is 1.

I think I see the problem. First some background:

execute() is a recursive function. Calling it on a list of commands results in recursive calls for each command, calling it on a pipeline or && or || of commands results in recursive calls for each pipe stage or branch.
One of the arguments to execute() is int* xerrok which ripples through (almost) all recursive calls.

The problem, I think, is this fragment of code:

case TOR:
case TAND:
  rv = execute(t->left, XERROK, xerrok);
  if ((rv == 0) == (t->type == TAND))
    rv = execute(t->right, flags & XERROK, xerrok);  // <=== one of the recursive calls
  else {
    flags |= XERROK;
    if (xerrok)
      *xerrok = 1;  // <=== disable `set -e` for *all* future execution!
  }
  break;

Instrumenting some printfs confirms that *xerrok is set even in subsequent statements in a for loop, after the && compound has completed.

Took me a while to figure out where in the world xerrok is being allocated. Answer: it's internal to execute() at the line:

if (xerrok == NULL)
  xerrok = &dummy;

So execute() is always called with NULL xerrok at the top-most level.

Why do we even need this variable? Now creating a few test scripts to try to answer that..

Another place where clues may be found is the CVS history.

The xerrok parameter was introduced in revision 1.49 of exec.c (jaredy, Jan 2009). Since then it's been tweaked just once, in revision 1.50 (millert, Jun 2013), fixing a bug in precisely the TAND case I've been looking at. That gives me some confidence there may still be a bug here.

Commit 1.50 makes mention of regression tests, and I remember that Stephen showed me where they were: /usr/src/regress/bin/ksh.

You run the regression tests by running:

/usr/src $ make regression-tests

I'm seeing some failures in my OpenBSD 5.9 release. Maybe because I'm running on zsh?

In any case, make output shows that the command for running ksh regression tests is:

/usr/bin/perl /usr/src/regress/bin/ksh/th -s /usr/src/regress/bin/ksh -p /bin/ksh -C pdksh,sh,ksh,posix,posix-upu -T /usr/src/regress/bin/ksh/obj

Looks like no other files were modified alongside exec.c rev 1.50.

I did uncover a related change from 2 days earlier while searching for it. After some reorg in Dec 2013, the current location of set -e tests is obsd-regress.t (just search for ' -e'). Looks like there's no tests combining for loops with set -e.

Ok, I have a short command to run the relevant regression test that currently passes from /usr/src/regress/bin/ksh:

perl th -s obsd-regress.t -p /usr/obj/bin/ksh/ksh -C sh -T /usr/src/regress/bin/ksh/obj

(-T seems to require an absolute path. Is it because obj is a symlink?)

Nice prompt email response from Todd Miller who added xerrok (millert above). He concurs that this seems like a bug.

A little spelunking into the th test-handler script in Perl yields a failing regression test:

name: and-list-error-3
description:
  Test combination of && and for loop in -e mode
# file foo0 is absent; foo1 is present
file-setup: file 644 "foo1"
# file bar0 is present; bar1 is absent
file-setup: file 644 "bar0"
arguments: !-e!
# first iteration of for loop errors at first branch of &&
#   execution should continue in spite of -e mode
# second iteration of for loop errors at second branch of &&
#   -e mode should trigger exit
stdin:
  for f in 0 1
  do
    test -f foo$f && test -f bar$f
  done
  echo "should not print"
expected-exit: e != 0
---

When I rip out the xerrok parameter entirely from the ksh codebase the above regression test passes. In fact, only two tests fail out of 11 tests for set -e: seterror-1 and seterror-2. Both have to do with running an && chain inside an if. Looks like the effort to fix if broke for loops.

Now looking more closely at the jaredy (Jared Yanovich) commit from Jan 2009.

Ok, I'm not smart enough for this. Ideas:

Rip out xerrok entirely. seterror-1 and seterror-2 fail.
Reset *xerrok after each command in TFOR. All tests pass(!) but it's too big a sledgehammer; this test fails:

name: seterror-9
description:
  The -e flag within a for statement should terminate && or || chains on
  failure in rightmost command.
stdin:
  for f in 0; do true && false; fi
  true
arguments: !-e!
expected-exit: e != 0

Replace the TOR/TAND case with a fragment modeled on TLIST; instead of recursing on each command, iterate through all the commands:

case TOR:
case TAND:
  while (t->type == TOR || t->type == TAND) {
    rv = execute(t->left, XERROK, NULL);  // NULL => xerrok: `and-list-error-3` fails; seterror-1 and `seterror-2` pass
    if ((rv == 0) == (t->type == TOR)) {
      flags |= XERROK;
      *xerrok = 1;
      goto Break;
    }
    t = t->right;
  }
  rv = execute(t, flags & XERROK, xerrok);

This passes my new regression test and-list-error-3, but fails seterror-1, seterror-2 and my new seterror-9 above. It feels closest to an elegant solution, but I need somehow to influence the current unwind before restoring the variable introduced in the previous stackframe.

One thing we have achieved: if (xerrok) inside execute() is utterly unnecessary. xerrok is always set if it's NULL on entry.

Miscellaneous annoyance: Motherfucking CVS, where I can't see all the changes made together. I can't revert or stash changes in a group, so juggling these different ideas creates a context switch and causes me to lose state.

A new morning, a new plan of attack.

A couple of functions have been useful to run from my base camp at /usr/src as part of my debug loop:

$ build() { cd bin/ksh; make; RET=$?; popd; return $RET; }
$ run_test() { cd regress/bin/ksh; perl th -s obsd-regress.t -p /usr/obj/bin/ksh/ksh -C sh -T /usr/src/regress/bin/ksh/obj; popd; }

Now I can run build && run_test to try out any changes I make.

Solved. Turns out the answer is this change to for loops:

    if (t->type == TFOR) {
+     save_xerrok = *xerrok;
      while (*ap != NULL) {
        setstr(global(t->str), *ap++, KSH_UNWIND_ERROR);
+       /* undo xerrok in all iterations except the last */
+       *xerrok = save_xerrok;
        rv = execute(t->left, flags & XERROK, xerrok);
      }
+     /* ripple xerrok set at final iteration of loop */
    }

Now investigating if we need to similarly change while and until loops as well..

Realized after sending out the patch today that I'd never recompiled the entire userspace with my change. Tried that -- and the build turned out to be corrupted yet again. Creating a new server from scratch. This time create a snapshot before messing with it.

Yes, the patch is fine. Accepted split up into multiple commits: 1 2 3 4 5 6. Major commits are 3 for the fix and 5 for the regression test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the trail of the ERREXIT bug in OpenBSD

Clone this wiki locally