You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem: with the current dependency specification, we have no option to declare what to do if a previous job fails. We just assume that if a job in a dependency chain fails, all subsequent jobs should be canceled.
There are a couple of use cases where if a job with dependents fails, it's "downstream" job may want to still run depending on the particular failure mode.
For example, in the discussions around our MCEM paper, we discussed proactively handling failures using job dependencies. Submit a chain of jobs that are dependent on one another, and then submit a "mesh relaxation" job that runs if the main simulation throws a "mesh tangled" exception. For illustration purposes:
+--------------+
|Sim Preprocess|
+-------+------+
|
|
v
+------+------+
|3D Simulation|
+-------------+
| |
Mesh Tangled Exception| |
v |
+-------------+-+ |
|Mesh Relaxation| |
+-------------+-+ |
| |
v |
+-----------------+---+ |
|3D Simulation Restart| |
+-----------------+---+ |
| |
| |
v v
+-+-------------+--+
|Sim Postprocessing|
+------------------+
Data staging is another use case. A job may want to continue to run even if its stage-in fails (it can just fallback to reading directly from the PFS). Or if the stage-out of a previous job fails, the dependent jobs may still want to run (maybe the data to stage-out was just cached data and can be re-generated if missing).
The text was updated successfully, but these errors were encountered:
Problem: with the current dependency specification, we have no option to declare what to do if a previous job fails. We just assume that if a job in a dependency chain fails, all subsequent jobs should be canceled.
There are a couple of use cases where if a job with dependents fails, it's "downstream" job may want to still run depending on the particular failure mode.
For example, in the discussions around our MCEM paper, we discussed proactively handling failures using job dependencies. Submit a chain of jobs that are dependent on one another, and then submit a "mesh relaxation" job that runs if the main simulation throws a "mesh tangled" exception. For illustration purposes:
Data staging is another use case. A job may want to continue to run even if its stage-in fails (it can just fallback to reading directly from the PFS). Or if the stage-out of a previous job fails, the dependent jobs may still want to run (maybe the data to stage-out was just cached data and can be re-generated if missing).
The text was updated successfully, but these errors were encountered: