Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid waiting on a single test for output #72

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Avoid waiting on a single test for output #72

wants to merge 1 commit into from

Conversation

ColMelvin
Copy link

@ColMelvin ColMelvin commented Feb 9, 2018

When running Perl's prove with multiple jobs (-j #), the running jobs can
get stuck waiting on a single test to complete. This occurs when a test
outputs to STDERR, but does not output to STDOUT for some time after that.
The parser iterator gets stuck waiting for can_read() on that one test,
instead of looking at all of the running tests.

Avoid this slowdown by returning, potentially early, to ensure nothing
holds up the other tests.


To view the problem mentioned, create a directory and populate it with the following files:

00_blocker.t:

use Test::More tests => 1;
print STDERR "Error message\n";
sleep 10;
pass();

01_blocked.t:

use Test::More tests => 1;
select undef, undef, undef, 0.009;  # Wait 9ms
pass();

Then make 98 copies of the blocked test, to emphasize the time difference:

 for i in $(seq -w 2 99); do ln -s 01_blocked.t ${i}_blocked.t; done

Run prove *_blocked.t. On my system, this takes about 7 wallclock seconds.

Run prove -j 2 *_blocked.t. With the parallelization of similar files, we would expect this to take about half the time. And, indeed, on my system it takes about 3 wallclock seconds.

If we were to add 00_blocker.t to the parallelized prove -j 2 run, we would expect it to start first and, with its 10 second sleep, it would force the other 99 tests to run sequentially. Thus, on my system it would take 10 seconds to run (as the max of 10 and 7 seconds).

Running prove --timer -j 2 *.t before this fix actually takes 13 seconds (or 10 + 3 seconds) on my system, because of the blocking nature. However, after the fix, it runs for 10 seconds, as expected.

When running Perl's prove with multiple jobs (-j #), the running jobs can
get stuck waiting on a single test to complete.  This occurs when a test
outputs to STDERR, but does not output to STDOUT for some time after that.
The parser iterator gets stuck waiting for can_read() on that one test,
instead of looking at all of the running tests.

Avoid this slowdown by returning, potentially early, to ensure nothing
holds up the other tests.
@ribasushi
Copy link

@ColMelvin does your change also happen to fix this deadlock? https://rt.cpan.org/Ticket/Display.html?id=108390#txn-1551927

@ColMelvin
Copy link
Author

@ribasushi It would appear so:

Before:

HARNESS_OPTIONS=j2 perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )'; echo $?


27245 acquiring lock

Killed
137

After:

HARNESS_OPTIONS=j2 perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )'; echo $?


27101 acquiring lock

===(    1036;0  1/?  1035/? )===========================================

27102 acquiring lock

test.t .... ok                                                          
./test.t .. ok      
All tests successful.
Files=2, Tests=8200,  1 wallclock secs ( 0.59 usr  0.01 sys +  0.02 cusr  0.00 csys =  0.62 CPU)
Result: PASS
0

@Leont
Copy link
Member

Leont commented Feb 10, 2018

Interesting. I wouldn't have expected a workaround to be that simple :-o

@ribasushi
Copy link

@Leont what's the next step to get this workaround in?

@Leont
Copy link
Member

Leont commented Feb 20, 2018

what's the next step to get this workaround in?

The main reason why I haven't merged this yet is that I want to understand why it works, it kind of shouldn't. A quick dive through the source suggests that it will essentially insert an "unknown" token into the stream, and that will usually be ignored. I've been lacking tuits lately for further investigation (TAP::Harness being a maze of lambdas spread over different classes doesn't help here), let alone targeted testing.

@bulk88
Copy link

bulk88 commented Feb 21, 2018

I agree with leont, there will be a gazillion unknown token objects created of class "TAP::Parser::Result::Unknown" and a large rise in CPU/mem usage.

If me reading the code is correct, that select() call is ONLY between STDOUT and STDERR of 1 proc, not a select() between multiple proc STD FHs. If STDERR gets data, the process.pm loop will block on STDERR or STDOUT again, you can't switch/leave the loop to another proc's FHs until the "current" proc outputs a STDOUT line.

It seems to me (I didnt step debug it), that

https://github.com/Perl-Toolchain-Gang/Test-Harness/blob/master/lib/TAP/Parser/Multiplexer.pm#L133

A select between ALL proc's ERR and OUT is done first, then another select is done between ONE proc's ERR and OUT. If one proc's ERR fires, currently, it will call select in a loop for one proc until that proc's OUT fires. In reality, Multiplexer.pm should reissue the global select, something like jump from my $result = $parser->next;

my $result = $parser->next;
to unless (@ready) {
unless (@ready) {
if ->next() was STDERR. Some API design is required since parser->next() is supposed to create/process a TAP line, not process empty string with the "unknown token" problem you identified.

=head3 C<get_select_handles>

Get an a list of file handles which can be passed to C<select> to
determine the readiness of this parser.

=cut

sub get_select_handles { shift->_iterator->get_select_handles }

If I read it litterally, a parser isn't "ready" if ERR fires, only if "OUT" fires. Maybe get_select_handles should only be returning OUT not OUT and ERR (I can't test this myself) but the parser should pump its OUT and ERR together. So for each OUT, try to pump ERR if anything is on ERR, but I can imagine a deadlock where a test proc filled the OS kernel buffer with ERR data and hung in a write() to ERR, but perl is sleeping in select on that proc's OUT, but there is nothing in OUT, and won't be until harness proc ERR is drained unfreezing the test proc. Maybe the solution is not to have one IO::Handle ERR per test proc, but have test procs inherit harness's ERR unless the "merge" option is used,

@bulk88
Copy link

bulk88 commented Feb 21, 2018

Thinking about the design, I think the API solution is, all iterator objects must have a method, will_block or ready or really_ready or can_read (bad name, confusing IO:Select), this method is inherited/forwarded by Parser object to Multiplexer, each time IO::Select returns a parser, the FH that fired the parser is passed to the parser, and a true/false is returned, if true, call _next();

return sub {

        # Drain all the non-selectable parsers first
        if (@$avid) {
            my ( $parser, $stash ) = @{ $avid->[0] };
            my $result = $parser->next;
            shift @$avid unless defined $result;
            return ( $parser, $stash, $result );
        }

        PULL:
        unless (@ready) {
            return unless $sel->count;
            @ready = $sel->can_read;
        }

        my ( $h, $parser, $stash, @handles ) = @{ shift @ready };
        goto PULL unless $parser->really_ready($h);
        my $result = $parser->next;
sub really_ready { shift->_iterator->really_ready @_ }
sub really_ready {
    my $self = shift;
    my $fh = shift;
    if ( $fh == $self->{err} ) {
	my $got = sysread $fh, my ($chunk), $self->{chunk_size};
	if ( $got == 0 ) {##????????
	    my $sel = $self->{sel}
	    $sel->remove($fh);
	}
	print STDERR $chunk;    # echo STDERR
	return 0;
    } else {
	return 1;
    }
}

edit:

ultimately, thinking about what I never finished at https://github.com/bulk88/Test-Harness/blob/bulk88/w32_async/lib/TAP/Parser/Multiplexer.pm#L154 , each iterator must have a method to be tickled for event loop purposes from 1 central loop but the API must ensure that you dont always have a "result" from a tickle event dispatching to an iterator. An example of a tickle without a result is a partial line on STDOUT without a newline. Currently in Process.pm, if a select fires on STDOUT, and the STDOUT has a partial line (AFAIK OS console pipes trigger select ready bit in units of 1 or more newlines at a time not once per char/byte so this maybe not observable real word), so with a partial line the inner per-proc select in Process.pm will execute goto REDO and will block on 1 proc until that 1 proc emits the rest of the line, starving other procs; just like a STDERR starves other procs. The sysread() should happen in the tickle method of the iterator, and _next method of iterator ONLY fetches already buffered in Perl land STDOUT lines queued in the iterator object the way I did it in my branch.

Theoretically, if there are multiple event loops from different classes of Iterator in the same harness proc, the tickle should be non-blocking poll as API design, so you can poll the iterator object just by itself every x milliseconds with a usleep() in between each poll(). Yeah its CPU wasted, but you sometimes CANT integrate 2 different event loops in the same "put proc to sleep" call, so sleep timers/time-outs and non-blocking is only way to service both event sources from 1 future-unwritten multiplexor.PM. If Multiplexer has special awareness of Iterator classes and can fuse them together in 1 wait-till-data syscall (select but more generic concept), then yeah, no data from any test proc==no CPU time in harness proc, then after the wake up from the fused wait-till-data() call in multiplexer, 1 particular parser/interator's tickle will be called, or ALL interators have their tickles called (lets say data-ready event was an OS signal handler, unknown which iterator fired the signal so all iterators are tickled),

@ColMelvin
Copy link
Author

@bulk88 Yep, the use of IO::Select->can_read in two different places is the issue. If Iterator::Process' blocks, then Multiplexer never gets to call its can_read, so the other processes get blocked writing to their pipes. By forcing Iterator::Process to not block on its can_read, we avoid the starvation in Multiplexer.

@ColMelvin
Copy link
Author

ColMelvin commented Feb 21, 2018

We have to return to avoid the $self->_finish call at the end of Iterator::Process. And returning undef indicates we are done with the Iterator, so it gets removed.

Could we change the system to ignore the empty string as a special case, instead of creating TAP::Parser::Result::Unknown objects? For example, we could modify TAP::Parser::Grammar::tokenize() to try again if it gets an empty string. I haven't tested it, but it seems like it would work.

my $line = '';
$line = $self->{iterator}->next while (defined $line && length $line == 0);

@bulk88
Copy link

bulk88 commented Feb 21, 2018

Here is a stack trace of how Iterator::next is reached from Multiplexer::next.

 at ../lib/TAP/Parser/Iterator/Process.pm line 289, <GEN21> line 47.
        TAP::Parser::Iterator::Process::__ANON__() called at ../lib/TAP/Parser/I
terator/Process.pm line 310
        TAP::Parser::Iterator::Process::next_raw(TAP::Parser::Iterator::Process=
HASH(0x21ec1ac)) called at ../lib/TAP/Parser/Iterator.pm line 66
        TAP::Parser::Iterator::next(TAP::Parser::Iterator::Process=HASH(0x21ec1a
c)) called at ../lib/TAP/Parser/Grammar.pm line 288
        TAP::Parser::Grammar::tokenize(TAP::Parser::Grammar=HASH(0x21ecafc)) cal
led at ../lib/TAP/Parser.pm line 1459
        eval {...} called at ../lib/TAP/Parser.pm line 1459
        TAP::Parser::__ANON__() called at ../lib/TAP/Parser.pm line 315
        TAP::Parser::next(TAP::Parser=HASH(0x1f6dcbc)) called at ../lib/TAP/Pars
er/Multiplexer.pm line 128
        TAP::Parser::Multiplexer::__ANON__() called at ../lib/TAP/Parser/Multipl
exer.pm line 183
        TAP::Parser::Multiplexer::next(TAP::Parser::Multiplexer=HASH(0x215454c))
 called at ../lib/TAP/Harness.pm line 624
        TAP::Harness::_aggregate_parallel(TAP::Harness=HASH(0x1ee88bc), TAP::Par
ser::Aggregator=HASH(0x20d1d34), TAP::Parser::Scheduler=HASH(0x21548ac)) called
at ../lib/TAP/Harness.pm line 739
        TAP::Harness::aggregate_tests(TAP::Harness=HASH(0x1ee88bc), TAP::Parser:
:Aggregator=HASH(0x20d1d34), "base/cond.t", "base/if.t", "base/lex.t", "base/num
.t", "base/pat.t", "base/rs.t", ...) called at ../lib/TAP/Harness.pm line 558
        TAP::Harness::__ANON__() called at ../lib/TAP/Harness.pm line 571
        TAP::Harness::runtests(TAP::Harness=HASH(0x1ee88bc), "base/cond.t", "bas
e/if.t", "base/lex.t", "base/num.t", "base/pat.t", "base/rs.t", "base/term.t", .
..) called at harness line 303
 at ../lib/TAP/Parser/Iterator/Process.pm line 289, <GEN21> line 48.
$line = $self->{iterator}->next while (defined $line && length $line == 0);

wouldnt that just block again on the same 2 ERR/OUT FHs? You are still inside 1 Parser which is bound to 1 Grammer which is bound to 1 Iterator. If the Parser (really the Iterator obj) doesn't return control to Multiplexer, or Multiplexer never calls into a Parse/Grammer/Iterator sandwich, you can't switch test procs, except, this is probably a joke, if you want "deep recursion" hell of having an Iterator that can't return a line, call back into Multiplexer to pop another FD/Parser off the IO::Select, then Multiplexer dispatches to a 2nd Parser, and potentially you can 1 (or fixed amount of) stack frame for every frozen Iterator (AKA Grammer AKA Parser) in the harness proc :-D

@Leont
Copy link
Member

Leont commented Feb 21, 2018

If me reading the code is correct, that select() call is ONLY between STDOUT and STDERR of 1 proc, not a select() between multiple proc STD FHs. If STDERR gets data, the process.pm loop will block on STDERR or STDOUT again, you can't switch/leave the loop to another proc's FHs until the "current" proc outputs a STDOUT line.

Correct.

The interesting question would be "why does it do this in the first place?". The git archeology is confusing (6641692 and fe53db5).

I'm wondering if just ignoring STDERR wouldn't be a more obvious fix.

@ribasushi
Copy link

@bulk88

but I can imagine a deadlock where a test proc filled the OS kernel buffer with ERR data and hung in a write() to ERR, but perl is sleeping in select on that proc's OUT

You do not have to imagine it - I wrote a self-contained test for exactly this years ago, and referenced it higher up: https://rt.cpan.org/Ticket/Display.html?id=108390#txn-1551927

Neither is this a hypothetical - this is actual code that had to be written to stop CI ( and -jN installs ) from hanging randomly: https://metacpan.org/source/RIBASUSHI/DBIx-Class-0.082841/t/lib/DBICTest/Util.pm#L74-100

The reason I am highlighting this: please get together and implement a solution - either the one proposed or something more palatable. It doesn't have to be pretty. Nor terribly efficient. It has to work.

@ColMelvin
Copy link
Author

IMHO, software should be, in decreasing order of importance:

  1. Reliable
  2. Correct
  3. Maintainable
  4. Performant

At present, the code is unreliable and incorrect vis-à-vis this issue. With this pull-request applied, it is reliable, though its correctness is arguable and the performance takes a hit.

Iff the performance hit credibly reduces reliability (e.g. because of the OOM killer), we should close this pull-request without applying it. Otherwise, I argue this fix should be applied and a second fix can be added to improve performance. This avoids any delays and reduces the pressure for getting out a cleaner fix (thereby allowing the developer to focus on the cleanliness of the fix, rather than rushing to get a fix out). If reliability issues are discovered down the line, this change is small enough to be reverted with a minimal chance of conflicts.

That said, I cannot commit to providing performance improvements. I know that can be an issue for FOSS projects sometimes.

@ColMelvin
Copy link
Author

@bulk88 Good catch on the Grammar thing. I guess we're still stuck with TAP::Parser::Result::Unknown objects :(.

Other than filtering out ::Result::Unknown objects with 0 length in TAP::Harness, long after they are created, I have no ideas that might improve performance with minimal work.

@Leont
Copy link
Member

Leont commented Feb 21, 2018

I'm wondering if just ignoreing STDERR wouldn't be a more obvious fix.

I just pushed a different fix based on this idea to a branch.

@Leont Leont self-assigned this May 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants