Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job dependencies conditional on previous jobs exit code/exceptions #236

Open
SteVwonder opened this issue Apr 18, 2020 · 0 comments
Open

Comments

@SteVwonder
Copy link
Member

Problem: with the current dependency specification, we have no option to declare what to do if a previous job fails. We just assume that if a job in a dependency chain fails, all subsequent jobs should be canceled.

There are a couple of use cases where if a job with dependents fails, it's "downstream" job may want to still run depending on the particular failure mode.

For example, in the discussions around our MCEM paper, we discussed proactively handling failures using job dependencies. Submit a chain of jobs that are dependent on one another, and then submit a "mesh relaxation" job that runs if the main simulation throws a "mesh tangled" exception. For illustration purposes:

                     +--------------+
                     |Sim Preprocess|
                     +-------+------+
                             |
                             |
                             v
                      +------+------+
                      |3D Simulation|
                      +-------------+
                      |             |
Mesh Tangled Exception|             |
                      v             |
        +-------------+-+           |
        |Mesh Relaxation|           |
        +-------------+-+           |
                      |             |
                      v             |
    +-----------------+---+         |
    |3D Simulation Restart|         |
    +-----------------+---+         |
                      |             |
                      |             |
                      v             v
                    +-+-------------+--+
                    |Sim Postprocessing|
                    +------------------+

Data staging is another use case. A job may want to continue to run even if its stage-in fails (it can just fallback to reading directly from the PFS). Or if the stage-out of a previous job fails, the dependent jobs may still want to run (maybe the data to stage-out was just cached data and can be re-generated if missing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant