Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesystem-related scaling issues for large pipelines #461

Open
gdevenyi opened this issue Jan 24, 2022 · 4 comments
Open

Filesystem-related scaling issues for large pipelines #461

gdevenyi opened this issue Jan 24, 2022 · 4 comments

Comments

@gdevenyi
Copy link

We're currenly trying to submit a MAGeT.py pipeline to Niagara for processing.

MAGeT.py ends up spinning for ~2h on CPU time doing something, before being killed by Niagara for being bad on a login node. No jobs ever get to submission, and no other work is done.

Run command

MAGeT.py --verbose --pipeline-name=ASYN-long-20220121 \
--subject-matter mousebrain \
--files inputs/*lsq6.mnc --config-file niagara-maget.cfg --queue-type slurm

Config

[Niagara]
queue-type=slurm
min-walltime=86400
max-walltime=86400
max-idle-time=3600
time-to-accept-jobs=1380
ppn=40
proc=40
mem=188
num-executors=50
greedy=True
subject-matter=mousebrain
lsq12-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/linear/Pydpiper_default_lsq12_protocol.csv
atlas-library=/home/m/mchakrav/tulste/scratch/maget-merged-long-202201/atlas/
masking-method=ANTS
registration-method=ANTS
masking-nlin-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/CIC/Pydpiper_mincANTS_SyN_0.1_Gauss_2_1_40_micron_MAGeT_one_level_MASKING.pl
nlin-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/CIC/Pydpiper_mincANTS_SyN_0.1_Gauss_2_1_40_micron_MAGeT_one_level.pl

The pipeline stages are generated:

-rw-r----- 1 gdevenyi mchakrav 237M 2022-01-24 13:19 ASYN-long-20220121_pipeline_stages.txt

However the log never goes beyond

[2022-01-24 13:20:43.835,pydpiper.execution.pipeline,INFO] Starting pipeline daemon...

Before being killed.

@bcdarwin
Copy link
Member

bcdarwin commented Jan 25, 2022

I am running some tests now so I'm not exactly sure what part of the code is responsible yet, but note that MAGeT scales (for total operations -- it's not as bad if you only consider registrations) at least like number of atlases * number of templates * number of subjects, so probably reducing the number of templates is the easiest way to bring down the overall cost. My guess is that the overall issue is some combination of redundant file accesses via pyminc or creation of the output directories, but there are some CPU-limited parts as well which could also be optimized.

@bcdarwin
Copy link
Member

Indeed, the majority of time appears to be spent in the output_directories and create_directories utility functions.

@bcdarwin
Copy link
Member

At some point I added --defer-directory-creation which should help with the create_directories contribution but not output_directories -- the latter is maybe a case of os.path functions doing I/O ...

@bcdarwin bcdarwin changed the title Scaling issues for large number of input subjects (>1000) in MAGeT.py 2.0.13 Filesystem-related scaling issues for large pipelines Jan 26, 2022
@gdevenyi
Copy link
Author

--defer-directory-creation was able to get past and get to job submission

Which then failed with:

   7723 [2022-01-31 11:02:47.794,pydpiper.execution.pipeline,ERROR] Failed launching executors from the server.
   7724 Traceback (most recent call last):
   7725   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline.py", line 825, in launchExecutorsFromServer
   7726     mem_needed=memNeeded, uri_file=self.exec_options.urifile)
   7727   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline.py", line 969, in launchPipelineExecutors
   7728     pipelineExecutor.submitToQueue(number=number)
   7729   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline_executor.py", line 440, in submitToQueue
   7730     raise SubmitError({ 'return' : p.returncode, 'failed_command' : submit_cmd })
   7731 pydpiper.execution.pipeline_executor.SubmitError: {'return': 1, 'failed_command': ['qbatch', '--chunksize=1', '--cores=1', '--jobname=ASYN-long-20220121-executor-2022-01-31-at-11-02-47', '-b', 'slurm',    7731 '--walltime=23:59:59', '-']}

Terminal said (should've been captured I think for the log?
.SBATCH error: Batch job submission failed: Pathname of a file, directory or other parameter too long

We're retrying with --csv-file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants