Filesystem-related scaling issues for large pipelines #461

gdevenyi · 2022-01-24T18:57:57Z

We're currenly trying to submit a MAGeT.py pipeline to Niagara for processing.

MAGeT.py ends up spinning for ~2h on CPU time doing something, before being killed by Niagara for being bad on a login node. No jobs ever get to submission, and no other work is done.

Run command

MAGeT.py --verbose --pipeline-name=ASYN-long-20220121 \
--subject-matter mousebrain \
--files inputs/*lsq6.mnc --config-file niagara-maget.cfg --queue-type slurm

Config

[Niagara]
queue-type=slurm
min-walltime=86400
max-walltime=86400
max-idle-time=3600
time-to-accept-jobs=1380
ppn=40
proc=40
mem=188
num-executors=50
greedy=True
subject-matter=mousebrain
lsq12-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/linear/Pydpiper_default_lsq12_protocol.csv
atlas-library=/home/m/mchakrav/tulste/scratch/maget-merged-long-202201/atlas/
masking-method=ANTS
registration-method=ANTS
masking-nlin-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/CIC/Pydpiper_mincANTS_SyN_0.1_Gauss_2_1_40_micron_MAGeT_one_level_MASKING.pl
nlin-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/CIC/Pydpiper_mincANTS_SyN_0.1_Gauss_2_1_40_micron_MAGeT_one_level.pl

The pipeline stages are generated:

-rw-r----- 1 gdevenyi mchakrav 237M 2022-01-24 13:19 ASYN-long-20220121_pipeline_stages.txt

However the log never goes beyond

[2022-01-24 13:20:43.835,pydpiper.execution.pipeline,INFO] Starting pipeline daemon...

Before being killed.

The text was updated successfully, but these errors were encountered:

bcdarwin · 2022-01-25T20:42:13Z

I am running some tests now so I'm not exactly sure what part of the code is responsible yet, but note that MAGeT scales (for total operations -- it's not as bad if you only consider registrations) at least like number of atlases * number of templates * number of subjects, so probably reducing the number of templates is the easiest way to bring down the overall cost. My guess is that the overall issue is some combination of redundant file accesses via pyminc or creation of the output directories, but there are some CPU-limited parts as well which could also be optimized.

bcdarwin · 2022-01-26T17:45:36Z

Indeed, the majority of time appears to be spent in the output_directories and create_directories utility functions.

bcdarwin · 2022-01-26T17:48:10Z

At some point I added --defer-directory-creation which should help with the create_directories contribution but not output_directories -- the latter is maybe a case of os.path functions doing I/O ...

gdevenyi · 2022-01-31T16:08:37Z

--defer-directory-creation was able to get past and get to job submission

Which then failed with:

   7723 [2022-01-31 11:02:47.794,pydpiper.execution.pipeline,ERROR] Failed launching executors from the server.
   7724 Traceback (most recent call last):
   7725   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline.py", line 825, in launchExecutorsFromServer
   7726     mem_needed=memNeeded, uri_file=self.exec_options.urifile)
   7727   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline.py", line 969, in launchPipelineExecutors
   7728     pipelineExecutor.submitToQueue(number=number)
   7729   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline_executor.py", line 440, in submitToQueue
   7730     raise SubmitError({ 'return' : p.returncode, 'failed_command' : submit_cmd })
   7731 pydpiper.execution.pipeline_executor.SubmitError: {'return': 1, 'failed_command': ['qbatch', '--chunksize=1', '--cores=1', '--jobname=ASYN-long-20220121-executor-2022-01-31-at-11-02-47', '-b', 'slurm',    7731 '--walltime=23:59:59', '-']}

Terminal said (should've been captured I think for the log?
.SBATCH error: Batch job submission failed: Pathname of a file, directory or other parameter too long

We're retrying with --csv-file

bcdarwin changed the title ~~Scaling issues for large number of input subjects (>1000) in MAGeT.py 2.0.13~~ Filesystem-related scaling issues for large pipelines Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filesystem-related scaling issues for large pipelines #461

Filesystem-related scaling issues for large pipelines #461

gdevenyi commented Jan 24, 2022

bcdarwin commented Jan 25, 2022 •

edited

Loading

bcdarwin commented Jan 26, 2022

bcdarwin commented Jan 26, 2022

gdevenyi commented Jan 31, 2022

Filesystem-related scaling issues for large pipelines #461

Filesystem-related scaling issues for large pipelines #461

Comments

gdevenyi commented Jan 24, 2022

bcdarwin commented Jan 25, 2022 • edited Loading

bcdarwin commented Jan 26, 2022

bcdarwin commented Jan 26, 2022

gdevenyi commented Jan 31, 2022

bcdarwin commented Jan 25, 2022 •

edited

Loading