-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0065: Retry failed tasks on demand in a pipeline #422
Conversation
@ScrapCodes: GitHub didn't allow me to request PR reviews from the following users: Tomcli. Note that only tektoncd members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
a32168c
to
2721ae0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ScrapCodes I updated the use cases to give a little bit more background for KFP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScrapCodes Thanks for the TEP. This is a really tricky problem/subject and one that vaires a lot depending on the users and the pipeline they run.
Running naively only the Task
in a Pipeline
that failed is a really britle and error-prone I think.
Let's take a example, and let's say I have a pipeline that build an image from source-code, provision a cluster, push the image to the cluster internal registry, run some test, and clean the cluster at the end (in a finaly
). If the "run some test" task failed, re-running only that task would fail instantaneously as, the cluster is gone, we did not provision a new one, …
How do we know which task we need to re-run in addition to the failed one ?
We could think of adding a way to declare such required dependency, so that we can re-run all the required task before the one that failed and that we want to re-run. This, however, add more complexity and verbosity to the API, for something that "might" happen from time to time.
Taking the example of kubernetes/kubernetes
(from the TEP), I understand the need, but as commented, the /retest
re-running only the failed jobs is something that is handled on its own, by a specific component. In addition, to compare prow's /retest
with Tekton we need to make some assumption: /retest
re-run jobs (failed one) with the assumption (it's a given) that a job stands on its own, aka does not depend on anything else than itself. If we were to compare, each prow
job should be seen either as a single TaskRun (running one-off task) or a PipelineRun (running a pipeline). And thus, one kubernetes/kubernetes
, if you have 10 jobs (that generate 10 ci status), you would have 10 pipelines ; re-running the failed one works just fine as each pipeline are standing on their own. Even better, the component that handles /rerun
for prow
could be re-used for tekton pipelines as well (and even if not using prow
, a component that handle this, could work with a lot of different construct in addition to tekton pipelines).
TEP, we are exploring the benefits of adding a new API `retry` which will allow a | ||
user to "on-demand", retry a failed pipeline run. In other words, | ||
`rerun failed tests` from CI/CD world. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we mean by a new API here ? In tektoncd/pipeline
or tektoncd/triggers
, … we don't really control what we expose, it's a standard REST API, completely managed by k8s api server. In this TEP, do we want to create a new component or do we want to add new fields to the current object in the current components ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, we would like to focus on the problem statement itself, so will change this. It is not yet clear, what it would look like yet, it can be covered as part of proposal/design discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can implement it similar to "cancel" and "pending" on pipelinerun where we change the spec.status from failed to retry to trigger this action.
For example, At present `/retest` at kubernetes/kubernetes repo reruns only | ||
the failed jobs, a new api for retrying failed `pipelineRun` will give out | ||
of the box support. Hope they use `tektoncd` as backend for their CI at some | ||
point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is achieve in kubernetes/kubernetes
(and in any other project using prow ) by a prow plugin, and not directly by the "controller" of the jobs. What this means is that it is not defined in any job configuration by the user.
Without this support, at present a pull-request author or reviewer has to | ||
individually, mark tests for rerun. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really follow this too. As commented a bit, this is something that can (should probably) be handled externally from pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change this as per feedback !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this TEP!
I think it's a very interesting discussion to have, and things discussed here might be relevant for other features too, such as checkpointing of pipelines partial state and partial execution of pipelines.
Some comments inline.
For example, At present `/retest` at kubernetes/kubernetes repo reruns only | ||
the failed jobs, a new api for retrying failed `pipelineRun` will give out | ||
of the box support. Hope they use `tektoncd` as backend for their CI at some | ||
point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do use Tekton to run some of our CI job in Tekton today.
And we do support retest of a single CI job through Tekton today.
A failure in any task in a pipeline today causes the entire pipeline to fail and to stop, so we need dedicated pipelines for each CI jobs to keep the CI jobs independent from each other. Besides, this approach helps with the notifications to GitHub.
This proposal might work nicely together with TEP 0050 - task failure, since that would allow a failed task to not fail an entire pipeline.
2. A pipeline may either have failed due to some failures in the tasks or may | ||
be user invoked cancel request. Retry only the failed/canceled tasks for a | ||
failed `pipelineRun`. | ||
3. Document the feature in the tekton documentation, explaining the use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation is definitely something we need to do, and should be part of the TEP.
I don't think it's a goal of this work though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove redundant goal
|
||
Presently, a pipeline has a mechanism for `retry` a task, which a pipeline author | ||
can configure at the time of creation of a `Pipeline` or a `PipelineRun`. In this | ||
TEP, we are exploring the benefits of adding a new API `retry` which will allow a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Whether this will require a new API or not feels like a design decision / implementation detail. I think we should focus first on identifying the functionality that we want to provide, goals and use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this, instead of explicitly stating API, we can say mechanism. And at the time of design discussion, we can define what that mechanism will look like precisely.
user to "on-demand", retry a failed pipeline run. In other words, | ||
`rerun failed tests` from CI/CD world. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is 100% fitting with the way we run pipelines in Tekton.
If a task in a pipeline fails, all tasks currently running finish running and then no new task is scheduled.
In this example:
A (ok) ---> B (skipped)
C (failed) ---> D (skipped)
My expectation for "retry failed pipeline" would be that (A) is skipped because it was successful before, so (B) and (C) start right away, and (D) is executed if (C) is successful.
to flakiness of jobs itself. In this case, simply retrying `n` number of times | ||
does not seem to help with optimal resource consumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CI use cases at least, unless we're 100% sure of that the different parts of the pipeline are perfectly independent, retrying a part of a pipeline might be a not a good practise, as it could lead to success for a pipeline that would normally have failed.
In case of a simple pipeline (A) ---> (B)
, (A) may create some "side-effect" state in the test cluster that will not be there if we execute (B) alone.
I think this would be very useful for pipelines that tasks that consume a lot of resources (CPU, network bandwidth or else). I suspect this might be implemented as an opt-in behaviour (even in the long run) for pipelines, as persisting the intermediate pipeline status might be expensive, both in terms of execution time as well as storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retrying a part of a pipeline might be a not a good practise, as it could lead to success for a pipeline that would normally have failed.
This could happen, if the pipeline author writes tasks unaware that they could be retried on a failure later. E.g. in your example, if (B) has a if
check, that relied on the side effect that (A) produced and if A's side effect has somehow vanished. Then (B)'s course of execution that leads to success could be changed.
It is possible, that new course of action is desirable by the actor who retried
. In other words, the responsibility of outcome, is shared between, the author of the pipeline/task and the actor who retried
.
tektoncd
's liability is limited to naively retrying failed tasks, if that leads to failure due to missing dependencies etc.. then, those errors are reported as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, opt-in behaviour makes perfect sense.
Allow setting the status of failed pipeline to something like: | ||
[TEP-0015 Pending](0015-pending-pipeline.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit more on this? I'm not sure understand how this would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more suited for our discussion during design, may be we can drop this section for now (i.e. current stage of discussing the problem statement).
1. Retry of successful pipeline runs or anything other than a failed pipeline | ||
run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that implementing this will require us to implement features like checkpointing of a pipeline, that might be used to re-run a pipeline or parts of a pipeline even if it did not fail, but I think it's fine to restrict the focus of the TEP to the failed case only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Focus of our TEP is retry failed tasks only.
3. Discuss checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s | ||
state stored in etcd is used as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this feature to work, the pipeline and taskruns in the retry execution will need access to the state of the previous run, so in case any of the state is deleted, the re-run will fail:
- writable workspaces needs to be still available:
- those provisioned by tekton will stick around until the pipelinerun is not deleted
- those provided by users are not controller by tekton
- results will be store in etcd in the original pipelinerun
Even workspaces + results might not be enough though, since the initial tasks or a pipeline might setup state outside of the workspace (e.g. create a test cluster, provision a test service), and that state will be usually cleaned-up by the finally
part of the pipeline. If the URI of the resources provisioned by initial tasks is exposed as results, those URIs can be passed to the new execution and if still valid the new execution could succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We put checkpoint as non-goal for now because we understand storing the state of the pipeline could be very complicated. Thus in this TEP we want to focus on how to manually retry the same pipelinerun just similar to how users cancel the pipelinerun.
If the information in etcd is not enough to retry the pipelinerun, then we may consider adding checkpoint as one of the requirements.
Thank you @vdemeester :)
I guess if a pipeline has a clean up in the finally, then rerun is ought to fail with suitable error. And that is expected, as a outcome of attempting a retry.
Similarly, at present, a task may miss an input, which was let's say the output of the previous task. Then, it is expected that the retry will fail as well. Idea of having a retry as on-demand, is a user makes a manual judgement and invokes a retry. It is part of Non goal, to manage the state of the failed pipeline by tektoncd.
This is correct, somehow, I could not communicate, that this support is to be used to invoke Somehow some pipelines are very resource consuming, for example they may be running some stress tests or benchmarks and may or may not have interdependency between each other. The manual invocation of Starting a pipeline all over again from the last run's configuration is already supported by tektoncd cli (tkn pipeline start --use-pipelinerun string ). |
/kind tep |
I will update the TEP soon, on the basis of the discussions above. |
/assign @afrittoli |
@vdemeester Would you like to be assigned to this TEP? |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hello @afrittoli and @vdemeester , thank you for reviewing the TEP and helping with the feedback, all of that was helpful in steering the direction that this TEP will move. I have tried to apply the feedback and updated the TEP, please take a look again? |
/assign |
hey @ScrapCodes !! I want to +1 to @afrittoli 's point about how this is similar to partial pipeline execution (#422 (review)) Partial pipeline execution (tektoncd/pipeline#50) is one of our oldest issues and is in our roadmap and @jerop previously created a proposal around disabling a task in a pipeline which at the time we tabled b/c not enough folks were convinced that we needed to tackle the underlying problem, but im wondering if being able to partially execute a pipeline would solve the problem you describe. for example imagine a scenario like this:
so TL;DR i agree with the problem statement you are describing, but also im thinking we could include the requirements and use cases from tektoncd/pipeline#50 and the disabling a task in a pipeline proposal (it even mentions a use case in the problem statement which sounds very much like yours!! "When running a very long Pipeline, there may be a transient failure and the user may want to only rerun the Tasks that failed..."), what do you think? |
Hello @bobcatfish, thanks for linking that discussion(shame on me, that I was unable to find it), it does seem to be a prior art 😄, and use cases can be borrowed too. AFAIU, there is one subtle difference in our requirements, when we say partial pipeline execution or checkpointing, it does seem that tekton's liability is slightly more than just naively rerunning a task. That brings in the issue of what happens to all the other things(pods,pvcs,... which may have vanished even if pipeline state is intact) that a tasks depends on, when it is started at a later time than the rest of the pipeline. Managing all that is beyond our scope, because a task can be written in so many ways. However, in this TEP's case, e.g. a pipeline is running a task which "supports retry" , and it fails. Since, it supports retry the liability is of the "pipeline/task" author to ensure that it can be retried on a failure. In this case, tekton has a limited liability of rerunning the task naively. In this way, we can support a reasonably common use case, without solving a very hard problem. Do you think this subtle difference make sense?
I may be wrong in my understanding, just sharing so that I can be corrected. Since you have already worked and invested a lot of thoughts on this, would you and/or @jerop like to co author this? |
Oh, I see, |
@afrittoli Would like to take another look! Even if your overall stance remain unchanged, you can resolve those comments that are clarified 😄 |
I am still not convinced this is at the pipelinerun level we need this, even at all tbh. This feature can be "done" through well written tasks and / or might not even be needed in most cases. Point being, the question we have to ask ourselves is how much do we want to add complexity to the Pipeline/PipelineRun reconciler for what gain ? (how many user will benefit from this, how hard will it be to use this feature, will there be requirement for this to work correctly, …)
The more I think about it, the more I am convinced this could / should be tackle by a higher construct, something like Tekton Workflows #464 for example. I'd rather keep Pipeline as simple and flexible as it can be, and build on top 👼🏼 . |
haha me too @afrittoli - i was surprised to realize how taskrun retries were implemented XD
im guessing it was just b/c it was easy - i dont remember realizing at the time how this was actually being implemented. (this isnt the only case of taskrun behavior being added just to support being executed via pipelinerun, i.e. a hidden coupling b/w taskruns and pipelineruns, for example tektoncd/pipeline#1069)
I like that idea a lot @vdemeester !! maybe implementing this in workflows is the way to go. For the questions you raised re. how this functionality would work, I'd like to point again toward tektoncd/pipeline#50 and the disabling/partial execution proposal - we'd address these questions by adding an interface for explicitly disabling tasks in a pipeline, with the ability to explicitly provide params and workspaces as needed.
@ScrapCodes the reason we need partial execution is b/c this is the feature that this proposal is requesting. From my understanding (let me know if this is wrong) this proposal is saying: The proposal is combining these together into one feature request, where the functionality in (a) is implicit - i.e. the user says "retry this pipelinerun from the point it fails at" and then gets (a) automatically. I'm suggesting we start by adding the functionality in (a) as part of the pipelinerun interface - because we're going to need it regardless of whether or not this is a feature of pipelineruns - and if this going to be a feature of a layer above pipelineruns (such as workspaces), we absolutely need it to be explicit. For example, let's say we want to use this feature for this example pipeline. The pipeline looks kinda like this:
Let's say we had 'retry from failure' for this pipeline, and build-skaffold-web failed, but build-skaffold-app succeeded. The PipelineRun would have a state like this:
What would happen when the task is re-run from the failure point? The resulting run would only partially execute the pipeline:
Let me know if this still isn't clear: what im trying to say is that the requirements for this feature include the requirements for partial execution, so we're going to need that feature either way. |
@ScrapCodes Maybe we can break it down into multiple smaller TEPs. In our original TEP, one of the key features we want to get out of this TEP is to retry individual task on demand in a pipeline. It seems like the current task retry is not a very good implementation, so we can use a new TEP to focus on that. As of some other features like partial pipeline execution would be helpful but it could get very complicated on figure out the correct I/O if the partial pipeline is running on a new pipelineRun. So I would suggest let's break it down into these steps.
From KFP-Tekton, we care more about 1 and 2 from above because most of our use cases are to retry one or two tasks on demand. 3 and 4 is more like nice to have to improve the user experience, but we don't really need it in most scenarios. |
My understanding is to begin with 3. and then work on 1 & 2. Soon, I will start the TEP for it. Thanks @jerop and @bobcatfish, since you have already worked on it, would you like to co-author / author? (@bobcatfish has already indicated her preference.) 😄 |
Sounds like a good approach to me @ScrapCodes !! @Tomcli I'm a bit less clear on definitely pursuing 1 + 2 as you described but like you said maybe they warrant their own TEPs and I'll be able to understand the need a bit better if we pursue them separately. @ScrapCodes I'm very happy to co-author as long as I'm not driving it haha XD If you want to collaborate in a Google doc or via hackmd I'm happy to contribute! |
Added comments to: Disabling a Task in a Pipeline |
Thanks @bobcatfish, 1 + 2 is the building blocks for how to retry a taskRun on demand. There are some discussions on how the taskRun retry could be done better, so we think is better to discuss that on a separate TEP. Once we agreed on how taskRun retry should be done and the corresponding reconcile logics, we can apply them to 3 + 4 since pipelineRun retry needs to retry the failed taskRun on demand. |
@ScrapCodes @Tomcli is this TEP on hold as we explore TEP-0077: Partial pipeline execution, or are you pursuing both of them separately? |
/hold |
Yes we are breaking down this TEP into smaller pieces, so we can keep the TEP on hold until we solve all the prerequisites. |
on hold while we persue TEP-0077 |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
@ScrapCodes are you still actively working on this TEP? Or should we close this for now and re-open it later? |
@dibyom , I am currently not pursuing this TEP. Happy to close it for now. |
Proposing only the problem statement.
/cc @Tomcli @afrittoli