Skip to content

Eagle Workflow Troubleshooting

Rajendra Adhikari edited this page Nov 21, 2024 · 25 revisions

Eagle Workflow Troubleshooting

Optimizing YML for simulation time

Wall Time thresholds:

  • <= 4 hours: Short nodes, which might get you scheduled sooner (depends upon how busy eagle short nodes are versus the regular nodes) and be done quicker.
  • <= 48 hours: Standard nodes
  • <= 240 hours: Long nodes.

No run is allowed to take more than 240 hours Wall Time. Visit the HPC website for canonical info, such as max nodes/user.

  • n_datapoints is number of buildings
    • if resample: false, downselect during sampling will reduce this number
    • if resample: true, this is roughly the number of buildings after downselect
  • n_jobs is number of Eagle nodes
  • 36 is the number of processor cores in each node
  • MF homes take much longer than SF homes. 2 minutes_per_sim is appropriate for SF, 30 minutes_per_sim is appropriate for MF
    • TimeOut errors could be caused by sampling including MF homes that you aren't expecting. Downselecting in your yml file may avoid this.

Calculating AU usage:

An upper limit is: AUs = 3 * ((n_datapoints * n_upgrades * minutes_per_sim) / cores_per_node + sampling.time + postprocessing.time * (postprocessing.n_workers + 1)) / minutes_per_hour

To get more accurate estimates, try the following:

  • Look at run results from similar, successful runs. The job.out, sampling.out, and postprocessing.out files will have elapsed time in the last few lines (in minutes).
  • job.out files are split fairly evenly, so looking at 1-2 and scaling up, then adding sampling and postproccessing should work fine
  • grep -E "^real\s+[0-9]+m" job.out-* in the directory with the output files will return the elapsed time from all job files
  • For a different sized run of similar complexity, scale results by the total number of simulations.
  • 1 AU = 3*(total elapsed time in hours)

Calculating WallTime:

walltime(hours) = math.ceil((n_datapoints * n_upgrades * minutes_per_sim) / (n_jobs * minutes_per_hour * cores_per_node))

Largest n_datapoints to not exceed a given WallTime:

n_datapoints = math.floor((walltime(hours) * minutes_per_hour * cores_per_node * n_jobs) / (minutes_per_sim * number_of_upgrades))

Largest n_jobs to not exceed a given WallTime:

n_jobs = math.floor((n_datapoints * n_upgrades * minutes_per_sim) / (walltime(hours) * minutes_per_hour * cores_per_node))