Job dependencies conditional on previous jobs exit code/exceptions #236

SteVwonder · 2020-04-18T03:47:03Z

Problem: with the current dependency specification, we have no option to declare what to do if a previous job fails. We just assume that if a job in a dependency chain fails, all subsequent jobs should be canceled.

There are a couple of use cases where if a job with dependents fails, it's "downstream" job may want to still run depending on the particular failure mode.

For example, in the discussions around our MCEM paper, we discussed proactively handling failures using job dependencies. Submit a chain of jobs that are dependent on one another, and then submit a "mesh relaxation" job that runs if the main simulation throws a "mesh tangled" exception. For illustration purposes:

                     +--------------+
                     |Sim Preprocess|
                     +-------+------+
                             |
                             |
                             v
                      +------+------+
                      |3D Simulation|
                      +-------------+
                      |             |
Mesh Tangled Exception|             |
                      v             |
        +-------------+-+           |
        |Mesh Relaxation|           |
        +-------------+-+           |
                      |             |
                      v             |
    +-----------------+---+         |
    |3D Simulation Restart|         |
    +-----------------+---+         |
                      |             |
                      |             |
                      v             v
                    +-+-------------+--+
                    |Sim Postprocessing|
                    +------------------+

Data staging is another use case. A job may want to continue to run even if its stage-in fails (it can just fallback to reading directly from the PFS). Or if the stage-out of a previous job fails, the dependent jobs may still want to run (maybe the data to stage-out was just cached data and can be re-generated if missing).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job dependencies conditional on previous jobs exit code/exceptions #236

Job dependencies conditional on previous jobs exit code/exceptions #236

SteVwonder commented Apr 18, 2020

Job dependencies conditional on previous jobs exit code/exceptions #236

Job dependencies conditional on previous jobs exit code/exceptions #236

Comments

SteVwonder commented Apr 18, 2020