-
Notifications
You must be signed in to change notification settings - Fork 14
Eagle Workflow Troubleshooting
Rajendra Adhikari edited this page Nov 21, 2024
·
25 revisions
- <= 4 hours: Short nodes, which might get you scheduled sooner (depends upon how busy eagle short nodes are versus the regular nodes) and be done quicker.
- <= 48 hours: Standard nodes
- <= 240 hours: Long nodes.
No run is allowed to take more than 240 hours Wall Time. Visit the HPC website for canonical info, such as max nodes/user.
-
n_datapoints
is number of buildings- if
resample: false
, downselect during sampling will reduce this number - if
resample: true
, this is roughly the number of buildings after downselect
- if
-
n_jobs
is number of Eagle nodes -
36
is the number of processor cores in each node - MF homes take much longer than SF homes. 2 minutes_per_sim is appropriate for SF, 30 minutes_per_sim is appropriate for MF
-
TimeOut
errors could be caused by sampling including MF homes that you aren't expecting. Downselecting in your yml file may avoid this.
-
An upper limit is: AUs = 3 * ((n_datapoints * n_upgrades * minutes_per_sim) / cores_per_node + sampling.time + postprocessing.time * (postprocessing.n_workers + 1)) / minutes_per_hour
To get more accurate estimates, try the following:
- Look at run results from similar, successful runs. The job.out, sampling.out, and postprocessing.out files will have elapsed time in the last few lines (in minutes).
- job.out files are split fairly evenly, so looking at 1-2 and scaling up, then adding sampling and postproccessing should work fine
-
grep -E "^real\s+[0-9]+m" job.out-*
in the directory with the output files will return the elapsed time from all job files - For a different sized run of similar complexity, scale results by the total number of simulations.
- 1 AU = 3*(total elapsed time in hours)
walltime(hours) = math.ceil((n_datapoints * n_upgrades * minutes_per_sim) / (n_jobs * minutes_per_hour * cores_per_node))
n_datapoints = math.floor((walltime(hours) * minutes_per_hour * cores_per_node * n_jobs) / (minutes_per_sim * number_of_upgrades))
n_jobs = math.floor((n_datapoints * n_upgrades * minutes_per_sim) / (walltime(hours) * minutes_per_hour * cores_per_node))