Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qstat returns no information about batch job #323

Open
cms21 opened this issue Mar 3, 2023 · 4 comments
Open

qstat returns no information about batch job #323

cms21 opened this issue Mar 3, 2023 · 4 comments
Assignees

Comments

@cms21
Copy link
Contributor

cms21 commented Mar 3, 2023

This is an issue seen on Polaris with PBS Pro. Jobs that are submitted to the prod queue are routed to the small, medium, and large queues. If something about that routing fails the job disappears from PBS's history. However, the original qsub command succeeded. So to Balsam, it assumes the batch job is queued and tries to look for it with qstat, but qstat fails. This causes an uncaught exception that crashes the site. Sample error below.

2023-02-13 04:31:20.411 | 167662 | ERROR | balsam:120] Uncaught Exception <class 'balsam.platform.scheduler.scheduler.SchedulerNonZeroReturnCode'>: qstat: Unknown Job Id 412635.polaris- pbs-01.hsn.cm.polaris.alcf.anl.gov { "timestamp":1676262680, "pbs_version":"2022.1.1.20220926110806", "pbs_server":"polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov" } Traceback (most recent call last): File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/util/process.py", line 17, in run self._run() File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/site/service/service_base.py", line 23, in _run self.run_cycle() File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/site/service/scheduler.py", line 154, in run_cycle job_log = self.scheduler.parse_logs(job.scheduler_id, job.status_info.get("submit_script", None)) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/scheduler.py", line 163, in parse_logs log_data = cls._parse_logs(scheduler_id, job_script_path) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/pbs_sched.py", line 300, in _parse_logs stdout = scheduler_subproc(args) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/scheduler.py", line 37, in scheduler_subproc raise SchedulerNonZeroReturnCode(p.stdout) balsam.platform.scheduler.scheduler.SchedulerNonZeroReturnCode: qstat: Unknown Job Id 412635.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov { "timestamp":1676262680, "pbs_version":"2022.1.1.20220926110806", "pbs_server":"polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov" }

@cms21
Copy link
Contributor Author

cms21 commented Apr 4, 2023

Bump this in priority

@cms21
Copy link
Contributor Author

cms21 commented Apr 4, 2023

Another user has encountered this on Polaris. The proposed solution that has been discussed was to parse the message that qstat returns for these jobs. It looks like this:
(2022-09-08/multirl) csimpson@polaris-login-02:~> qstat -f -x 456714
qstat: Unknown Job Id 456714.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov

@cms21
Copy link
Contributor Author

cms21 commented Apr 4, 2023

However the solution I gave to the user was to hack site/service/scheduler.py. This could also be a solution. The proposed change would be to replace

job_log = self.scheduler.parse_logs(job.scheduler_id, job.status_info.get("submit_script", None))
with this:

try:
job_log = self.scheduler.parse_logs(job.scheduler_id, job.status_info.get("submit_script", None))
except:
logger.exception(f"Job {job.scheduler_id} not found by scheduler")
continue

@cms21
Copy link
Contributor Author

cms21 commented Apr 28, 2023

PR #345 fixes part of this issue. When Balsam queries PBS with qstat it will check if Unknown Job Id is part of the returned message in the case of a non-zero return code. If this happens, the state is changed to submit_failed.

This will not handle the situation of a Balsam site has been inactive for a period of longer than 2 weeks and was not able to get information on the finished batch job before PBS purges the record. In this case further development is needed and this PR will change its state to submit_failed erroneously. However, a user can fix the state of the batch job by hand. It's unclear how common of an issue the this second case is, but should be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants