Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use an Asynchronous Event Loop in SbyJob (rebased) #108

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

edbordin
Copy link
Contributor

@edbordin edbordin commented Aug 9, 2020

This is mostly just a rebased version of #39. On @whitequark's request I have also replaced the use of taskkill on Windows with a console process group that is terminated using a ctrl+break signal. I verified using SysInternals ProcessMonitor that this causes the child processes to exit with code 0xC000013A which means they received the signal.

cc: @cr1901 as the original author

Copy link
Member

@nakengelhardt nakengelhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there are a lot of considerations that went into SbyJob/SbyTask that I am not fully aware of and that you'll need to wait for review from @clairexen on, but just testing it out I did notice that the error handling is now significantly less graceful, e.g. if a solver is not installed:

on master:

make test_demo2
cd sbysrc && python3 sby.py -f demo2.sby
SBY 12:57:53 [demo2] Removing directory 'demo2'.
SBY 12:57:53 [demo2] Writing 'demo2/src/demo.v'.
SBY 12:57:53 [demo2] engine_0: aiger suprove
SBY 12:57:53 [demo2] engine_1: aiger avy
SBY 12:57:53 [demo2] nomem: starting process "cd demo2/src; yosys -ql ../model/design_nomem.log ../model/design_nomem.ys"
SBY 12:57:53 [demo2] nomem: finished (returncode=0)
SBY 12:57:53 [demo2] aig: starting process "cd demo2/model; yosys -ql design_aiger.log design_aiger.ys"
SBY 12:57:53 [demo2] aig: finished (returncode=0)
SBY 12:57:53 [demo2] engine_0: starting process "cd demo2; suprove model/design_aiger.aig"
SBY 12:57:53 [demo2] engine_1: starting process "cd demo2; avy --cex - model/design_aiger.aig"
SBY 12:57:53 [demo2] engine_0: finished (returncode=127)
SBY 12:57:53 [demo2] engine_0: COMMAND NOT FOUND. ERROR.
SBY 12:57:53 [demo2] engine_1: finished (returncode=0)
SBY 12:57:53 [demo2] engine_1: Status returned by engine: PASS
SBY 12:57:53 [demo2] summary: Elapsed clock time [H:MM:SS (secs)]: 0:00:00 (0)
SBY 12:57:53 [demo2] summary: Elapsed process time [H:MM:SS (secs)]: 0:00:00 (0)
SBY 12:57:53 [demo2] summary: engine_1 (aiger avy) returned PASS
SBY 12:57:53 [demo2] DONE (ERROR, rc=16)
make: *** [test_demo2] Error 16

on async (after merging master to get the new tests):

make test_demo2
cd sbysrc && python3 sby.py -f demo2.sby
SBY 12:59:13 [demo2] Removing directory 'demo2'.
SBY 12:59:13 [demo2] Writing 'demo2/src/demo.v'.
SBY 12:59:13 [demo2] engine_0: aiger suprove
SBY 12:59:13 [demo2] engine_1: aiger avy
SBY 12:59:13 [demo2] nomem: starting process "cd demo2/src; yosys -ql ../model/design_nomem.log ../model/design_nomem.ys"
SBY 12:59:13 [demo2] nomem: finished (returncode=0)
SBY 12:59:13 [demo2] aig: starting process "cd demo2/model; yosys -ql design_aiger.log design_aiger.ys"
SBY 12:59:13 [demo2] aig: finished (returncode=0)
SBY 12:59:13 [demo2] engine_0: starting process "cd demo2; suprove model/design_aiger.aig"
SBY 12:59:13 [demo2] engine_1: starting process "cd demo2; avy --cex - model/design_aiger.aig"
SBY 12:59:13 [demo2] engine_0: finished (returncode=127)
Traceback (most recent call last):
  File "sby.py", line 430, in <module>
    retcode |= run_job(t)
  File "sby.py", line 388, in run_job
    job.run(setupmode)
  File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 691, in run
    self.taskloop()
  File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 267, in taskloop
    loop.run_until_complete(poll_fut)
  File "/Users/nak/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
  File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 302, in task_poller
    await task.shutdown_and_notify()
  File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 186, in shutdown_and_notify
    self.handle_exit(self.p.returncode)
  File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 109, in handle_exit
    self.exit_callback(retcode)
  File "/Users/nak/Source/SymbiYosys/sbysrc/sby_engine_aiger.py", line 81, in exit_callback
    assert retcode == 0
AssertionError
make: *** [test_demo2] Error 1

From the latter it really isn't clear what the problem is at all.

I also remember that we ran into https://bugs.python.org/issue23548 a few times in CI when processes launched in asyncio errored, and subcommands failing is a very frequent and expected thing to happen in sby. Is this accounted for here?

@edbordin
Copy link
Contributor Author

It might be a couple days before I can dig into this again, but the first issue with missing tools definitely looks like a regression introduced by me. The handling of exit code 127 was added since the original PR was authored and although I tried to incorporate it, it looks like it will need to be adapted.

I haven't run into the second issue but it looks like the atexit handler here might be enough to mitigate it on older versions of python.

I am assuming the CI infrastructure is internal to Symbiotic EDA? I wouldn't blame you for wanting some unit tests for this but wouldn't want to reinvent the wheel. Also some of these errors are probably race conditions and hard to reproduce...

FWIW I was having issues with Yices crashing on OS X with python 3.8 (which uses the same event loop as linux) and here are two example logs:

2020-07-22T11:18:27.3795240Z SBY 11:18:27 [cover] Removing directory 'cover'.
2020-07-22T11:18:27.3796110Z SBY 11:18:27 [cover] Copy 'cover.sv' to 'cover/src/cover.sv'.
2020-07-22T11:18:27.3810530Z SBY 11:18:27 [cover] engine_0: smtbmc
2020-07-22T11:18:27.3846430Z SBY 11:18:27 [cover] base: starting process "cd cover/src; yosys -ql ../model/design.log ../model/design.ys"
2020-07-22T11:18:27.4315680Z SBY 11:18:27 [cover] base: finished (returncode=0)
2020-07-22T11:18:27.4317070Z SBY 11:18:27 [cover] smt2: starting process "cd cover/model; yosys -ql design_smt2.log design_smt2.ys"
2020-07-22T11:18:27.4468120Z SBY 11:18:27 [cover] smt2: finished (returncode=0)
2020-07-22T11:18:27.4469830Z SBY 11:18:27 [cover] engine_0: starting process "cd cover; yosys-smtbmc --presat --unroll -c --noprogress -t 20  --append 0 --dump-vcd engine_0/trace%.vcd --dump-vlogtb engine_0/trace%_tb.v --dump-smtc engine_0/trace%.smtc model/design_smt2.smt2"
2020-07-22T11:18:27.5208620Z SBY 11:18:27 [cover] engine_0: ##   0:00:00  Solver: yices
2020-07-22T11:18:27.5274760Z SBY 11:18:27 [cover] engine_0: Traceback (most recent call last):
2020-07-22T11:18:27.5281100Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/yosys-smtbmc", line 1365, in <module>
2020-07-22T11:18:27.5282940Z SBY 11:18:27 [cover] engine_0: smt_assert_antecedent("(|%s_is| s0)" % (topmod))
2020-07-22T11:18:27.5284390Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/yosys-smtbmc", line 1149, in smt_assert_antecedent
2020-07-22T11:18:27.5284990Z SBY 11:18:27 [cover] engine_0: smt.write("(assert %s)" % expr)
2020-07-22T11:18:27.5285930Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 441, in write
2020-07-22T11:18:27.5286940Z SBY 11:18:27 [cover] engine_0: stmt = self.unparse(self.unroll_stmt(s))
2020-07-22T11:18:27.5288130Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 297, in unroll_stmt
2020-07-22T11:18:27.5288700Z SBY 11:18:27 [cover] engine_0: stmt = [self.unroll_stmt(s) for s in stmt]
2020-07-22T11:18:27.5289650Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 297, in <listcomp>
2020-07-22T11:18:27.5290200Z SBY 11:18:27 [cover] engine_0: stmt = [self.unroll_stmt(s) for s in stmt]
2020-07-22T11:18:27.5291150Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 327, in unroll_stmt
2020-07-22T11:18:27.5291720Z SBY 11:18:27 [cover] engine_0: self.write(self.unparse(decl), unroll=False)
2020-07-22T11:18:27.5292630Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 474, in write
2020-07-22T11:18:27.5293170Z SBY 11:18:27 [cover] engine_0: self.p_write(stmt + "\n", True)
2020-07-22T11:18:27.5294060Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 358, in p_write
2020-07-22T11:18:27.5294600Z SBY 11:18:27 [cover] engine_0: if flush: self.p.stdin.flush()
2020-07-22T11:18:27.5295000Z SBY 11:18:27 [cover] engine_0: BrokenPipeError: [Errno 32] Broken pipe
2020-07-22T11:18:27.5330800Z SBY 11:18:27 [cover] engine_0: finished (returncode=1)
2020-07-22T11:18:27.5332370Z SBY 11:18:27 [cover] ERROR: engine_0: Engine terminated without status.
2020-07-22T11:18:27.5336050Z SBY 11:18:27 [cover] DONE (ERROR, rc=16)
2020-07-22T11:18:27.5449760Z 
2020-07-22T11:18:27.5545270Z ##[error]Bash exited with code '16'.
2020-07-22T11:28:44.8130880Z SBY 11:28:44 [cover] Removing directory 'cover'.
2020-07-22T11:28:44.8131790Z SBY 11:28:44 [cover] Copy 'cover.sv' to 'cover/src/cover.sv'.
2020-07-22T11:28:44.8321440Z SBY 11:28:44 [cover] engine_0: smtbmc
2020-07-22T11:28:44.8639690Z SBY 11:28:44 [cover] base: starting process "cd cover/src; yosys -ql ../model/design.log ../model/design.ys"
2020-07-22T11:28:45.0847470Z SBY 11:28:45 [cover] base: finished (returncode=0)
2020-07-22T11:28:45.0849010Z SBY 11:28:45 [cover] smt2: starting process "cd cover/model; yosys -ql design_smt2.log design_smt2.ys"
2020-07-22T11:28:45.1195680Z SBY 11:28:45 [cover] smt2: finished (returncode=0)
2020-07-22T11:28:45.1197430Z SBY 11:28:45 [cover] engine_0: starting process "cd cover; yosys-smtbmc --presat --unroll -c --noprogress -t 20  --append 0 --dump-vcd engine_0/trace%.vcd --dump-vlogtb engine_0/trace%_tb.v --dump-smtc engine_0/trace%.smtc model/design_smt2.smt2"
2020-07-22T11:28:45.2781990Z SBY 11:28:45 [cover] engine_0: ##   0:00:00  Solver: yices
2020-07-22T11:28:45.2829950Z SBY 11:28:45 [cover] engine_0: ##   0:00:00  Checking cover reachability in step 0..
2020-07-22T11:28:45.3661690Z SBY 11:28:45 [cover] engine_0: ##   0:00:00  Unexpected EOF response from solver.
2020-07-22T11:28:45.3719500Z SBY 11:28:45 [cover] engine_0: finished (returncode=1)
2020-07-22T11:28:45.3779920Z SBY 11:28:45 [cover] ERROR: engine_0: Engine terminated without status.
2020-07-22T11:28:45.3892720Z SBY 11:28:45 [cover] DONE (ERROR, rc=16)
2020-07-22T11:28:45.4223840Z 
2020-07-22T11:28:45.4299320Z ##[error]Bash exited with code '16'.

@nakengelhardt
Copy link
Member

Yes, looking at it more closely it's just a question of moving the check for returncode == 127 before the call to self.handle_exit on line 186. I've had improving the error handling on my to-do list for an embarrassingly long time, that's why I was a bit sensitive to it. Those error traces you posted do look fine, it's relatively clear what the problem is.

We have added some tests to this repo the other day, which we also use in CI. If you merge recent master you can use make test now, and that should cover some ground (but they're not unit tests, and they will miss things). It also tries out all the solvers, so if you don't have some of them installed, you may have to call the tests individually that only use solvers you have, as there's no skipping mechanism yet. For the bug I linked, I don't really know how to reproduce it either - it was thrown by the CI runner code, which happens to use asyncio, not in the sby code itself, which doesn't yet. It might be a nonissue, I was just wondering if it was considered.

I'll be honest though, it's going to be an uphill battle getting this PR merged, as there's a lot of platforms to consider (every flavor of linux from ancient CentOS distributions to cutting-edge Arch, MacOS, several ways of launching things on Windows...) and we don't have the structure in place to test a major rewrite like this on all of them. The current implementation is at a point where we're mostly sure that it works most of the time on most of these platforms only because we got enough bug reports from users that told us when it didn't work, and it's an unappealing proposition to make our customers go through that again. Do I understand the description of the original PR correctly that the problem this addresses is just the performance impact of not running multiple solvers in parallel on windows?

@cr1901
Copy link
Contributor

cr1901 commented Aug 20, 2020

Do I understand the description of the original PR correctly that the problem this addresses is just the performance impact of not running multiple solvers in parallel on windows?

I can't remember if multiple solvers in parallel worked in Windows before this PR. When I said "parallel" in the original PR, that was in reference to the intended select functionality.

The performance impact comes from select returning immediately on Windows because selecting on file descriptors isn't actually supported. Because select returns immediately, the Windows version of sby busy-waits for all events to complete, rather than blocking in select until either an event arrives or a timeout is reached. I think there should be parity in resource usage between Windows and Unix here.

From what I remember, sby's main loop is basically: If an event arrives, handle event. If timeout is reached, check to see if we should bail, and if not, wait for another event. IIRC, the async rewrite removes the "check to see if we should bail part", and thus the need for any looping logic, by making the Python event loop decide when to bail.

@nakengelhardt
Copy link
Member

Ah, I see. So it's still working correctly before this change though, just suboptimally because it wastes one core busy-looping, right?

@cr1901
Copy link
Contributor

cr1901 commented Aug 22, 2020

@nakengelhardt I apologize, it looks like I misremembered a bit. The busy-waiting is a problem, but by "parallel" it looks like I was referring to this comment in a previous PR:

However, the support is limited to one task per job, because the poll loop relies on select.

Unfortunately, I don't remember the difference between SbyJob and SbyTask anymore (Job is per solver, and Task is "unit of work associated with a Job?") .

At the time, what I meant by this is "if you're doing prove mode, which will run BMC and k-induction, sby on Linux will run BMC and k-induction at the same time for a given solver- i.e. in parallel. On Windows, BMC will run first and then k-induction will run- i.e. sequentially." While sby does order Jobs (or is it Tasks?) to run in the proper order, BMC and k-induction do not depend on each other, so they can run at the same time. FWIW, I can confirm this problem still exists on Windows as of the current HEAD (b172357).

I didn't write a good bug report- if I did, I would described this behavior in more detail other than "the culprit is select doesn't work on fds on Windows". At this time, I don't remember why select's behavior on Windows breaks running jobs in parallel. The async rewrite fixed this behavior as well as the "busy waiting pinning a core" problem, at the cost of... well, other issues as described in this PR and #39.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants