-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bin/shtests io leaves processes running #213
Comments
I'm not having any luck reproducing this. Tried Slackware and CentOS, both x86_64, on VirtualBox VMs. Tried replacing the login shell with ksh. Tried compiling with and without vmalloc. No change, no hanging processes, everything seems normal. @JohnoKing, @hyenias, any ideas? |
I haven't been able to reproduce the hanging bug. The closest I've gotten is by running
|
I'm open to anything about how to test for this. If it's the OS rather than ksh, wtf about the OS could it be? All three VM's have ksh u+m as their login shell, all use the same rc files. That shouldn't matter, but I put it out there as data. These are Fusion VM's. A fresh checkout of the repo does this. It's additive, each time I'm all the way back to d89ef0f and haven't found it to stop happening. I had not been running u+m on Linux at all until very recently, I was using 93u+ 20120801 on CentOS. I only noticed it because the tty assigned to the shell was increasing when I killed the terminal window and launched a new one. |
The only ones that hang around for you are the shcomp tests, for me it's all three posix/utf8/shcomp. I let it sit for ten minutes, they're still there.
|
Determined something else. If I ssh into the VM, and do the tests, the processes do not appear to hang. I get the same behavior @JohnoKing does. But running the test in gnome-terminal (apparently on all three of them, it's something called If I run the test in a simple xterm, the presentation is the same as in #213 (comment) again. Momentary processes, then they're gone. |
(sigh) Of course, I wonder what problem they think they're fixing with this client/server terminal architecture. I suppose yes, this issue is a problem. Is it a Interesting: https://eklitzke.org/gnome-terminal-server |
As far as I know urxvt and mlterm are unrelated to gnome or VTE, and both can/do also run as client/server, though I don't have much experience with them and I don't know under which conditions. Perhaps it's worth trying them too. |
I've managed to reproduce the bug in an Ubuntu 20.04 virtual machine using gnome-terminal, urxvt and xterm:
The bug does not occur in ksh93u+ but it can be reproduced in ksh93u+m. I couldn't get the bug to occur on bare metal. |
This bug was introduced in commit ab5dedd. I can't get it to occur in the latest commit if ksh/src/cmd/ksh93/include/defs.h Lines 51 to 52 in d4adc8f
Also of note is that the bug behaves differently in ab5dedd. |
Using a process substitution triggers the bug: $ cat ./procsub.sh
func() {
true
}
/bin/true <(true) >(true)
true <(true) >(true)
func <(true) >(true)
ps ax | grep ksh | grep -v grep | wc -l
$ arch/*/bin/ksh ./procsub.sh
8
$ arch/*/bin/ksh ./procsub.sh
14
$ arch/*/bin/ksh ./procsub.sh
20
$ arch/*/bin/ksh ./procsub.sh
26 |
After running
The repeated Lines 85 to 94 in d4adc8f
In my case, if I modify the following check to look for PID 658 the bug goes away in my Ubuntu VM: Line 89 in d4adc8f
This indicates |
I can't reproduce the bug using the patch below. @posguy99 does this diff fix the bug on your end? (edit: diff updated multiple times): diff --git a/src/cmd/ksh93/include/defs.h b/src/cmd/ksh93/include/defs.h
index 41b1729..f1b4310 100644
--- a/src/cmd/ksh93/include/defs.h
+++ b/src/cmd/ksh93/include/defs.h
@@ -138,6 +138,7 @@ struct shared
pid_t pid; /* $$, the main shell's PID (invariable) */
pid_t ppid; /* $PPID, the main shell's parent's PID */
pid_t current_pid; /* ${.sh.pid}, PID of current ksh process (updates when subshell forks) */
+ pid_t parent_pid; /* PID of the parent shell or subshell (used in process substitutions) */
int realsubshell; /* ${.sh.subshell}, actual subshell level (including virtual and forked) */
unsigned char sigruntime[2];
Namval_t *bltin_nodes;
diff --git a/src/cmd/ksh93/sh/init.c b/src/cmd/ksh93/sh/init.c
index bc226e9..5d5c3d2 100644
--- a/src/cmd/ksh93/sh/init.c
+++ b/src/cmd/ksh93/sh/init.c
@@ -1238,7 +1238,7 @@ Shell_t *sh_init(register int argc,register char *argv[], Shinit_f userinit)
sh_regress_init(shp);
#endif
shgd = sh_newof(0,struct shared,1,0);
- shgd->current_pid = shgd->pid = getpid();
+ shgd->parent_pid = shgd->current_pid = shgd->pid = getpid();
shgd->ppid = getppid();
shgd->userid=getuid();
shgd->euserid=geteuid();
@@ -1634,7 +1634,7 @@ int sh_reinit(char *argv[])
shp->errtrap = 0;
shp->end_fn = 0;
/* update ${.sh.pid}, $$, $PPID */
- shgd->current_pid = shgd->pid = getpid();
+ shgd->parent_pid = shgd->current_pid = shgd->pid = getpid();
shgd->ppid = getppid();
return(1);
}
diff --git a/src/cmd/ksh93/sh/subshell.c b/src/cmd/ksh93/sh/subshell.c
index 28d1c40..ef59cbc 100644
--- a/src/cmd/ksh93/sh/subshell.c
+++ b/src/cmd/ksh93/sh/subshell.c
@@ -215,6 +215,7 @@ void sh_subfork(void)
sp->subpid=0;
shp->st.trapcom[0] = (comsub==2 ? NULL : trap);
shp->savesig = 0;
+ shgd->parent_pid = shgd->current_pid;
/* sh_fork() increases ${.sh.subshell} but we forked an existing virtual subshell, so undo */
shgd->realsubshell--;
}
diff --git a/src/cmd/ksh93/sh/xec.c b/src/cmd/ksh93/sh/xec.c
index c116e34..a11d6d9 100644
--- a/src/cmd/ksh93/sh/xec.c
+++ b/src/cmd/ksh93/sh/xec.c
@@ -85,8 +85,7 @@ struct funenv
static void fifo_check(void *handle)
{
Shell_t *shp = (Shell_t*)handle;
- pid_t pid = getppid();
- if(pid==1)
+ if(kill(shp->gd->parent_pid,0) < 0)
{
unlink(shp->fifo);
sh_done(shp,0);
@@ -1894,7 +1893,7 @@ int sh_exec(register const Shnode_t *t, int flags)
_sh_fork(shp,pid,0,0);
if(pid==0)
{
- shgd->current_pid = getpid();
+ shgd->parent_pid = shgd->current_pid = getpid();
sh_exec(t->par.partre,flags);
shp->st.trapcom[0]=0;
sh_done(shp,0);
diff --git a/src/cmd/ksh93/tests/io.sh b/src/cmd/ksh93/tests/io.sh
index ace60fe..c900b4f 100755
--- a/src/cmd/ksh93/tests/io.sh
+++ b/src/cmd/ksh93/tests/io.sh
@@ -715,5 +715,18 @@ got=$(export tmp; "$SHELL" -ec \
(redirect {v}>$tmp/v.out; echo ok2 >&$v) 2>/dev/null
[[ -r $tmp/v.out && $(<$tmp/v.out) == ok2 ]] || err_exit 'redirect {varname}>file not working in a subshell'
+# ======
+# Test for process substitution infinite looping
+procsub_pid="$(
+ ulimit -t unlimited # fork the subshell
+ true >(true)
+ echo "$!"
+)"
+sleep 2 # wait for process to close (long wait to avoid false failures)
+
+kill -0 $procsub_pid 2> /dev/null &&
+ kill -KILL $procsub_pid && # don't leave around what is effectively a zombie process
+ err_exit "process substitutions loop after parent shell finishes"
+
# ======
exit $((Errors<125?Errors:125)) |
Thank you for tracing that. Glad it's not the same bug as #67 after all. Instead, it's systemd breaking UNIX. No big shocker there…
|
src/cmd/ksh93/sh/xec.c: - An infinite loop may occur if an orphaned process from a process substitution isn't adopted by PID 1. Rather than close the FIFO when the parent process is init, check for the existence of the the parent shell or parent forked subshell by using kill(2). src/cmd/ksh93/include/defs.h, src/cmd/ksh93/sh/init.c, src/cmd/ksh93/sh/subshell.c: - Add a new 'parent_pid' variable to store the correct PPID. The PPID cannot be obtained after creating the process substitution (as that causes intermittent looping) and 'shgd->current_pid' can't be used as it's the process substitution's PID. As it's only used for FIFO process substitutions, hide it behind !SHOPT_DEVFD. src/cmd/ksh93/tests/io.sh: - Test for the infinite looping bug by using a forked command substitution. Fixes: ksh93#213
Given this comment, is the patch under test the one in #213 (comment) , or some other patch? I want to test the right one. :) Edit: tested the patch in #213 (comment), the issue does not occur on Mint. |
The original patch I posted used |
Update: tested the patch from #213 (comment) in all three Linux VMs. Mint, Ubuntu, CentOS... the issue does not occur in any of them. |
Come on @McDutchie , aren't you a believer in LennartOS? |
Well, I went public web surfing looking up what a process substitution was and found the following test describing an issue returning spaces before and after its output which messed up calls to tar due to the extra spaces around it. I regret that I cannot find where I read it from as my history and further attempts to find it have failed. The following test creates an orphaned process 100% of the time on multiple terminal clients tried whether bare metal or ssh'ed across multiple boxes: Ubuntu 20.04, macOS, and FreeBSD 12.2. $ echo $KSH_VERSION
Version AJMv 93u+m/1.0.0-alpha+37c9ac27 2021-03-01
$ echo foo<(:)bar
foo /tmp/ksh.f060mj5 bar
$ jobs -l
[1] + 109506 Running echo foo<(:)bar
$ kill 109506
$ jobs -l
$ exit
~ $ /bin/ksh
$ echo $KSH_VERSION
Version AJM 93u+ 2012-08-01
$ echo foo<(:)bar
foo /dev/fd/4 bar
$ jobs -l
$ I will now work on updating and testing current #214. |
Confirmed, even with #213 (comment) applied, the orphan process is created. |
Confirmed as well. The fix I applied isn't correct as it causes $ true <(true)
$ wait
# Freezes The patch below takes a different approach. Rather than wait for the parent shell to close, diff --git a/src/cmd/ksh93/sh/xec.c b/src/cmd/ksh93/sh/xec.c
index c116e341..19f68bd3 100644
--- a/src/cmd/ksh93/sh/xec.c
+++ b/src/cmd/ksh93/sh/xec.c
@@ -85,12 +85,8 @@ struct funenv
static void fifo_check(void *handle)
{
Shell_t *shp = (Shell_t*)handle;
- pid_t pid = getppid();
- if(pid==1)
- {
- unlink(shp->fifo);
- sh_done(shp,0);
- }
+ unlink(shp->fifo);
+ sh_done(shp,0);
}
#endif /* !SHOPT_DEVFD */
This fixes the |
I was coming to the same conclusion. I'm not a fan of adding the extra global flag for this specific issue, either. There should be no need for any timer or check. Once the system has a file descriptor handle on a file, the file can be unlinked and it will remain open and working until it is closed. That is fundamental to all UNIX-like systems. The same is true for a FIFO, which is not even really a file -- it's just a pipe with a directory entry in the file system. Once I think the whole Also, there was no check for whether Please try this patch. diff --git a/src/cmd/ksh93/sh/xec.c b/src/cmd/ksh93/sh/xec.c
index c116e341..14d4f59e 100644
--- a/src/cmd/ksh93/sh/xec.c
+++ b/src/cmd/ksh93/sh/xec.c
@@ -81,19 +81,6 @@ struct funenv
/* ======== command execution ========*/
-#if !SHOPT_DEVFD
- static void fifo_check(void *handle)
- {
- Shell_t *shp = (Shell_t*)handle;
- pid_t pid = getppid();
- if(pid==1)
- {
- unlink(shp->fifo);
- sh_done(shp,0);
- }
- }
-#endif /* !SHOPT_DEVFD */
-
#if _lib_getrusage
/* getrusage tends to have higher precision */
static void get_cpu_times(Shell_t *shp, struct timeval *tv_usr, struct timeval *tv_sys)
@@ -1704,16 +1691,20 @@ int sh_exec(register const Shnode_t *t, int flags)
#if !SHOPT_DEVFD
if(shp->fifo && (type&(FPIN|FPOU)))
{
- int fn,fd = (type&FPIN)?0:1;
- void *fifo_timer=sh_timeradd(500,1,fifo_check,(void*)shp);
+ int fn, fd, save_errno;
+ fd = (type & FPIN) ? 0 : 1;
fn = sh_open(shp->fifo,fd?O_WRONLY:O_RDONLY);
- timerdel(fifo_timer);
- sh_iorenumber(shp,fn,fd);
- sh_close(fn);
- sh_delay(.001,0);
+ save_errno = errno;
unlink(shp->fifo);
free(shp->fifo);
shp->fifo = 0;
+ if(fn<0)
+ {
+ errno = save_errno;
+ errormsg(SH_DICT,ERROR_SYSTEM|ERROR_PANIC,"process substitution: FIFO open failed");
+ }
+ sh_iorenumber(shp,fn,fd);
+ sh_close(fn);
type &= ~(FPIN|FPOU);
}
#endif /* !SHOPT_DEVFD */ |
By the way, not that it matters if we are getting rid of |
That patch causes the regression tests added in #214 to fail:
|
Hmm. It does that with my patch, as well. Darn. |
The error check I added does suggest the reason for the freeze:
|
I think @JohnoKing's patch is correct. Works for me. However, I'd also like to add the error check to |
As you all have already confirmed, it works as far as I know as well. That's two fixes for the price of one--excellent. BTW, I used the following method to acquire an updated patch file to use with |
sh_open() blocks infinitely because the command |
I found a way to break the current PR. Reproducer: $ fn() { sleep 1; cat "$1"; }
$ fn <(echo hi)
cat: /var/folders/nx/k3vcpmhj2mv05p9mhrwf8_9c0000gr/T/ksh.f2tqq6n: No such file or directory (expected output: Because the parent PID check was removed, the FIFO was deleted after 500ms (half a second), so the function cannot access it after sleeping for one second. Back to the drawing board. Sorry. |
OK, stepping back and analysing the situation... the failed attempts have given us useful info. With my more radical patch, that removes the whole timer altogether, the reproducer above works fine. Which is not surprising because And waiting forever is normally what we want, if the command actually uses the FIFO file name argument and opens it. It is allowed to take an indeterminate amount of time to get around to doing that, though. And it may not ever do it at all, such as when passing it as an argument to The problem is, Instead, then, maybe it will work to:
I think this should make it work while avoiding a global flag. Thoughts? |
Ah yes, of course we cannot add a new argument to |
I tried that already and it doesn't work correctly. It reintroduces the original bug (#213 (comment)) if placed right before
|
Confirmed. Turns out this is not actually a new problem. It was not introduced by your original patch but was there ever since ksh 93u+m switched to the FIFO method in ab5dedd. So it's an original bug in that method. Of course it should still be fixed somehow, but at this point I don't know how to go about that. |
In the master branch FIFOs stay open after being opened by a process substitution. Reading from the FIFO closes it: $ echo <(true)
/tmp/ksh.f2ikp4r
$ [ -e /tmp/ksh.f2ikp4r ]; echo $?
0
$ cat /tmp/ksh.f2ikp4r
$ [ -e /tmp/ksh.f2ikp4r ]; echo $?
1 Perhaps this bug could be fixed by removing the FIFO when it's no longer needed (i.e., when the command finishes running). |
I'll note this is can be done without reintroducing #213 (comment) (although it can't fix the |
That would be progress, at least. I've pushed a commit to the PR that I believe implements this in a more specific manner. The io.sh[743] |
Everything is working correctly in the new commit (besides the |
When bash uses the FIFO method and the command doesn't use the filename argument passed by the command substitution, it hangs, too. This is on AIX, which does not have
|
I suspect the remaining regression is either not feasible to fix or not worth bothering with, as it is inherent to using FIFOs. If the parent process never uses the FIFO and never terminates either, then the child is going to block on opening the FIFO forever -- that is just how FIFOs work. Trying to get around that might get very hairy, particularly as you can pass any number of process substitutions to one command. I'm going to sleep on it and come back with my thoughts tomorrow. The original bug is now fixed, so tomorrow I might decide to merge the PR minus the failing test. |
I remember similar issues with the Solaris SMF which allows for different 'restarter' processes that take over part of the old job of |
I may have been too pessimistic in my message yesterday. It was in fact rather hairy, but I think I've got something working that properly stops unused process substitutions from lingering around. I've added another commit to the PR. See the commit message for the explanation of the new code. Please test, everyone. |
@hyenias wrote:
With the latest commit added to PR #214, the reproducer in your comment no longer leaves a lingering process. Can you confirm? As for the added spaces, that is indeed a difference from bash and zsh, but I'm not sure if it should be called a bug or not. Process substitutions are replaced by file names and it generally doesn't make much sense to concatenate those with other strings. OTOH I could also see an argument that if there isn't a space, ksh has no business inserting one. |
@JohnoKing found that reproducer I used at att#1487. I thought its case made sense to me but I have zero experience with process substitution. I go test now. |
It does make sense actually. What happens is not exactly that a space is inserted but that edit: Now filed as #215. |
No trouble with any of my attempts including using the I was wondering what could make that 50ms timer to be exceeded during the process substitution launch before it uses the FIFO. Perhaps network latency and/or remote program startup? Maybe launching a larger program than just a little /bin/* command perhaps a python3 script or a java app just might take longer than 50ms to respond. If so, perhaps if that hard coded 50ms could be tunable via a variable would allow some more latency/startup cushion in a particular case. |
The timer is not a timeout, it's the interval for the check if the child process forked by the process substitution should still be waiting for the parent to open the pipe. It does that by checking if the parent process still exists. That check was done every 500ms, it's now done every 50ms. I shortened the interval because computers are 1000s of times faster than when this was written and still won't even register the overhead on the radar, and it leaves any hanging process open for that much shorter. |
Thank you for the clarification. And with that, everything seems good to me. |
I've tested the latest commit and everything is working on my end. The |
I've tested here on Catalina, which doesn't break anything, and on Mint, which is no longer broken when running
The test from @hyenas also works now:
Edit: Yes, the first time I posted, I thought the echo reproducer failed. It helps to run the right ksh binary. :( |
The wait freeze no longer occurs as well.
|
Sounds great, thanks for testing everyone. I'll merge the pull request now. |
Huh? Why are all these processes being left? What are they?
This is happening on Linux. I have a Mint VM, as well as Ubuntu and CentOS. They all do it. They hang around forever.
For once, macOS does not do it. So far, anyway.
The text was updated successfully, but these errors were encountered: