From 7bac645ac58b970f5bc08af64dc24ddc335e793d Mon Sep 17 00:00:00 2001 From: Prashant Sharma Date: Fri, 7 May 2021 16:07:24 +0530 Subject: [PATCH] TEP-0065: Retry failed tasks on demand in a pipeline KFP's use case. Co-authored-by: Tommy Li --- teps/0065-retry-failed-tasks-on-demand.md | 262 ++++++++++++++++++++++ teps/README.md | 1 + 2 files changed, 263 insertions(+) create mode 100644 teps/0065-retry-failed-tasks-on-demand.md diff --git a/teps/0065-retry-failed-tasks-on-demand.md b/teps/0065-retry-failed-tasks-on-demand.md new file mode 100644 index 000000000..b8d896635 --- /dev/null +++ b/teps/0065-retry-failed-tasks-on-demand.md @@ -0,0 +1,262 @@ +--- +status: proposed +title: Retry failed tasks on-demand in a pipeline +creation-date: '2021-05-07' +last-updated: '2021-05-07' +authors: +- '@Tomcli' +- '@ScrapCodes' +--- + +# TEP-0065: Retry failed tasks on-demand, in a pipeline + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Use Cases (optional)](#use-cases-optional) +- [Requirements](#requirements) +- [Proposal](#proposal) + - [Notes/Caveats (optional)](#notescaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) + - [User Experience (optional)](#user-experience-optional) + - [Performance (optional)](#performance-optional) +- [Design Details](#design-details) +- [Test Plan](#test-plan) +- [Design Evaluation](#design-evaluation) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) +- [Upgrade & Migration Strategy (optional)](#upgrade--migration-strategy-optional) +- [References (optional)](#references-optional) + + +## Summary + +Presently, a pipeline has a mechanism for `retry`, which a pipeline +author can configure at the time of creation of a `Pipeline` or a +`PipelineRun`. In this TEP, we are exploring the benefits of adding a new +mechanism `retry` which will allow a user to - "on-demand" retry a failed +`pipelineRun`. A failed `pipelineRun` may have some or all tasks failed, then +a retry would make only the failed tasks run again, the successfully +completed tasks are skipped. + +This will be an opt-in behaviour for pipeline and tasks, a pipeline or +a task author will be able to define that his pipeline or task does +support a retry or not. + +## Motivation + +**Optimal use of cluster resources.** + +Ability to `retry` failed tasks is especially useful, where `tekton` is a +backend for running Machine learning pipelines. A machine learning pipeline +may consist of tasks moving large amount of data and then training ml models, +all of it can be very resource consuming and inability to retry would require +a user to start the entire pipeline over. Sometimes, the failure could be due +to temporary service outages. For example, after training the model, a task +reporting the metrics fails due to temporary service outage. A retry after +some time could easily fix it. + +A pipeline may be defined with various tasks, and some tasks might move a +large amount of data and incur cost. This `retry` mechanism has substantial +value, where each task of the pipeline incurs a significant computing resources, +e.g. `tekton` is used as a backend for ML pipelines. + +_Why do we need a new `retry` mechanism when we already support retry in +`Pipeline` tasks?_ + +The present `retry` field can only be defined at the time of creation of +pipeline. This is not suitable for use cases, where a manual intervention +is necessary to decide whether a rerun is required or not. +For example, if a service outage is causing a particular task failure, then +retrying `n` times, won't help, unless we wait for the service to be back +again and retry. For such manual interventions, we need on-demand `retry` +mechanism. + +Another concocted example, if `Pipeline` were to represent a CI/CD job, then +tasks represent test suit, stress test and benchmarks. Now, we need a way to +know whether a failure was due to some regression, or it is due to flakiness +of jobs itself or temporary service outage. In this case, simply retrying `n` +number of times does not seem to help with optimal resource consumption. + +### Goals + +1. Explore both the merits and demerits in having a new mechanism for on-demand + retrying, an _only a failed_ pipeline. +2. A pipeline may either have failed due to some failures in the tasks or may + be user invoked cancel request. Retry only the failed/canceled tasks for a + failed `pipelineRun`. + +### Non-Goals + +1. Retry of successful pipeline runs or anything other than a failed pipeline/task + run. +2. Changing existing retry mechanism. +3. Manage checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s + state stored in etcd is used as is. +4. Determine, a failed tasks dependencies i.e. figuring out what + all dependent tasks are needed to rerun the failed task. + +### Use Cases (optional) + +1. `PipelineRun` can be very resource consuming, and are sometimes susceptible to + fail due to transient conditions. For example, due to service outage of a + particular service. In such cases, it is not enough to be retried `n` times, + a manual invocation of retry is required. + +2. It will be possible to cancel (e.g. preemption) any running `PipelineRun`, and + resume at a later point. + +## Requirements + + + +## Proposal + + + +### Notes/Caveats (optional) + + +1. What happens if the pipeline has finally tasks that do the cleanup ? + + For example, at the clean-up step in finally, a cluster is deleted. For + cases, such as this, the pipeline author can define his pipeline and not + support a manual retry. Or, if the support is a requirement, then redesign + the finally-task such that the clean-up is not done if the pipeline failed. + +2. What happens if the failed task, depends on the side effect of another task. + e.g. In case of a simple pipeline `(A) ---> (B)`, (A) may create some + "side effect" state in the test cluster that will not be there if we execute + (B) alone. To overcome these challenges, we could implement this as a kind of + `opt-in` behaviour, a pipeline or task author will have the ability to + define, his task or pipeline supports a `retry`. + +### Risks and Mitigations + + + +### User Experience (optional) + + + +### Performance (optional) + + + +## Design Details + + + +## Test Plan + + + +## Design Evaluation + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (optional) + + + +## Upgrade & Migration Strategy (optional) + + + +## References (optional) + + diff --git a/teps/README.md b/teps/README.md index f822e31de..bce28a414 100644 --- a/teps/README.md +++ b/teps/README.md @@ -185,3 +185,4 @@ This is the complete list of Tekton teps: |[TEP-0059](0059-skip-guarded-task-only.md) | Skip Guarded Task Only | proposed | 2021-03-24 | |[TEP-0061](0061-allow-custom-task-to-be-embedded-in-pipeline.md) | Allow custom task to be embedded in pipeline | implementable | 2021-04-28 | |[TEP-0063](0063-workspace-dependencies.md) | Workspace Dependencies | proposed | 2021-04-23 | +|[TEP-0065](0065-retry-failed-tasks-on-demand.md) | Retry failed tasks on-demand in a pipeline | proposed | 2021-05-07 |