Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0065: Retry failed tasks on demand in a pipeline #422

Closed
wants to merge 2 commits into from

Conversation

ScrapCodes
Copy link
Contributor

Proposing only the problem statement.

/cc @Tomcli @afrittoli

@tekton-robot
Copy link
Contributor

@ScrapCodes: GitHub didn't allow me to request PR reviews from the following users: Tomcli.

Note that only tektoncd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Proposing only the problem statement.

/cc @Tomcli @afrittoli

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot requested a review from afrittoli May 7, 2021 10:46
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 7, 2021
@ScrapCodes ScrapCodes force-pushed the tep-65 branch 4 times, most recently from a32168c to 2721ae0 Compare May 7, 2021 12:30
Copy link
Contributor

@Tomcli Tomcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ScrapCodes I updated the use cases to give a little bit more background for KFP.

teps/0065-retry-failed-tasks-on-demand.md Outdated Show resolved Hide resolved
teps/0065-retry-failed-tasks-on-demand.md Outdated Show resolved Hide resolved
teps/0065-retry-failed-tasks-on-demand.md Outdated Show resolved Hide resolved
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScrapCodes Thanks for the TEP. This is a really tricky problem/subject and one that vaires a lot depending on the users and the pipeline they run.

Running naively only the Task in a Pipeline that failed is a really britle and error-prone I think.

Let's take a example, and let's say I have a pipeline that build an image from source-code, provision a cluster, push the image to the cluster internal registry, run some test, and clean the cluster at the end (in a finaly). If the "run some test" task failed, re-running only that task would fail instantaneously as, the cluster is gone, we did not provision a new one, …

How do we know which task we need to re-run in addition to the failed one ?
We could think of adding a way to declare such required dependency, so that we can re-run all the required task before the one that failed and that we want to re-run. This, however, add more complexity and verbosity to the API, for something that "might" happen from time to time.

Taking the example of kubernetes/kubernetes (from the TEP), I understand the need, but as commented, the /retest re-running only the failed jobs is something that is handled on its own, by a specific component. In addition, to compare prow's /retest with Tekton we need to make some assumption: /retest re-run jobs (failed one) with the assumption (it's a given) that a job stands on its own, aka does not depend on anything else than itself. If we were to compare, each prow job should be seen either as a single TaskRun (running one-off task) or a PipelineRun (running a pipeline). And thus, one kubernetes/kubernetes, if you have 10 jobs (that generate 10 ci status), you would have 10 pipelines ; re-running the failed one works just fine as each pipeline are standing on their own. Even better, the component that handles /rerun for prow could be re-used for tekton pipelines as well (and even if not using prow, a component that handle this, could work with a lot of different construct in addition to tekton pipelines).

Comment on lines 39 to 41
TEP, we are exploring the benefits of adding a new API `retry` which will allow a
user to "on-demand", retry a failed pipeline run. In other words,
`rerun failed tests` from CI/CD world.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean by a new API here ? In tektoncd/pipeline or tektoncd/triggers, … we don't really control what we expose, it's a standard REST API, completely managed by k8s api server. In this TEP, do we want to create a new component or do we want to add new fields to the current object in the current components ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we would like to focus on the problem statement itself, so will change this. It is not yet clear, what it would look like yet, it can be covered as part of proposal/design discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can implement it similar to "cancel" and "pending" on pipelinerun where we change the spec.status from failed to retry to trigger this action.

Comment on lines 58 to 61
For example, At present `/retest` at kubernetes/kubernetes repo reruns only
the failed jobs, a new api for retrying failed `pipelineRun` will give out
of the box support. Hope they use `tektoncd` as backend for their CI at some
point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is achieve in kubernetes/kubernetes (and in any other project using prow ) by a prow plugin, and not directly by the "controller" of the jobs. What this means is that it is not defined in any job configuration by the user.

Comment on lines 63 to 64
Without this support, at present a pull-request author or reviewer has to
individually, mark tests for rerun.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really follow this too. As commented a bit, this is something that can (should probably) be handled externally from pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change this as per feedback !

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this TEP!
I think it's a very interesting discussion to have, and things discussed here might be relevant for other features too, such as checkpointing of pipelines partial state and partial execution of pipelines.
Some comments inline.

Comment on lines 58 to 61
For example, At present `/retest` at kubernetes/kubernetes repo reruns only
the failed jobs, a new api for retrying failed `pipelineRun` will give out
of the box support. Hope they use `tektoncd` as backend for their CI at some
point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do use Tekton to run some of our CI job in Tekton today.
And we do support retest of a single CI job through Tekton today.

A failure in any task in a pipeline today causes the entire pipeline to fail and to stop, so we need dedicated pipelines for each CI jobs to keep the CI jobs independent from each other. Besides, this approach helps with the notifications to GitHub.

This proposal might work nicely together with TEP 0050 - task failure, since that would allow a failed task to not fail an entire pipeline.

2. A pipeline may either have failed due to some failures in the tasks or may
be user invoked cancel request. Retry only the failed/canceled tasks for a
failed `pipelineRun`.
3. Document the feature in the tekton documentation, explaining the use cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is definitely something we need to do, and should be part of the TEP.
I don't think it's a goal of this work though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove redundant goal


Presently, a pipeline has a mechanism for `retry` a task, which a pipeline author
can configure at the time of creation of a `Pipeline` or a `PipelineRun`. In this
TEP, we are exploring the benefits of adding a new API `retry` which will allow a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Whether this will require a new API or not feels like a design decision / implementation detail. I think we should focus first on identifying the functionality that we want to provide, goals and use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this, instead of explicitly stating API, we can say mechanism. And at the time of design discussion, we can define what that mechanism will look like precisely.

Comment on lines 40 to 41
user to "on-demand", retry a failed pipeline run. In other words,
`rerun failed tests` from CI/CD world.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is 100% fitting with the way we run pipelines in Tekton.
If a task in a pipeline fails, all tasks currently running finish running and then no new task is scheduled.

In this example:

A (ok) ---> B (skipped)
C (failed) ---> D (skipped)

My expectation for "retry failed pipeline" would be that (A) is skipped because it was successful before, so (B) and (C) start right away, and (D) is executed if (C) is successful.

Comment on lines 55 to 56
to flakiness of jobs itself. In this case, simply retrying `n` number of times
does not seem to help with optimal resource consumption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CI use cases at least, unless we're 100% sure of that the different parts of the pipeline are perfectly independent, retrying a part of a pipeline might be a not a good practise, as it could lead to success for a pipeline that would normally have failed.

In case of a simple pipeline (A) ---> (B), (A) may create some "side-effect" state in the test cluster that will not be there if we execute (B) alone.

I think this would be very useful for pipelines that tasks that consume a lot of resources (CPU, network bandwidth or else). I suspect this might be implemented as an opt-in behaviour (even in the long run) for pipelines, as persisting the intermediate pipeline status might be expensive, both in terms of execution time as well as storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrying a part of a pipeline might be a not a good practise, as it could lead to success for a pipeline that would normally have failed.

This could happen, if the pipeline author writes tasks unaware that they could be retried on a failure later. E.g. in your example, if (B) has a if check, that relied on the side effect that (A) produced and if A's side effect has somehow vanished. Then (B)'s course of execution that leads to success could be changed.

It is possible, that new course of action is desirable by the actor who retried. In other words, the responsibility of outcome, is shared between, the author of the pipeline/task and the actor who retried.

tektoncd's liability is limited to naively retrying failed tasks, if that leads to failure due to missing dependencies etc.. then, those errors are reported as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, opt-in behaviour makes perfect sense.

Comment on lines 203 to 204
Allow setting the status of failed pipeline to something like:
[TEP-0015 Pending](0015-pending-pipeline.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit more on this? I'm not sure understand how this would work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more suited for our discussion during design, may be we can drop this section for now (i.e. current stage of discussing the problem statement).

Comment on lines 77 to 91
1. Retry of successful pipeline runs or anything other than a failed pipeline
run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that implementing this will require us to implement features like checkpointing of a pipeline, that might be used to re-run a pipeline or parts of a pipeline even if it did not fail, but I think it's fine to restrict the focus of the TEP to the failed case only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus of our TEP is retry failed tasks only.

Comment on lines 80 to 94
3. Discuss checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s
state stored in etcd is used as is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this feature to work, the pipeline and taskruns in the retry execution will need access to the state of the previous run, so in case any of the state is deleted, the re-run will fail:

  • writable workspaces needs to be still available:
    • those provisioned by tekton will stick around until the pipelinerun is not deleted
    • those provided by users are not controller by tekton
  • results will be store in etcd in the original pipelinerun

Even workspaces + results might not be enough though, since the initial tasks or a pipeline might setup state outside of the workspace (e.g. create a test cluster, provision a test service), and that state will be usually cleaned-up by the finally part of the pipeline. If the URI of the resources provisioned by initial tasks is exposed as results, those URIs can be passed to the new execution and if still valid the new execution could succeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We put checkpoint as non-goal for now because we understand storing the state of the pipeline could be very complicated. Thus in this TEP we want to focus on how to manually retry the same pipelinerun just similar to how users cancel the pipelinerun.

If the information in etcd is not enough to retry the pipelinerun, then we may consider adding checkpoint as one of the requirements.

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented May 10, 2021

@ScrapCodes Thanks for the TEP. This is a really tricky problem/subject and one that vaires a lot depending on the users and the pipeline they run.

Thank you @vdemeester :)

Running naively only the Task in a Pipeline that failed is a really britle and error-prone I think.

Let's take a example, and let's say I have a pipeline that build an image from source-code, provision a cluster, push the image to the cluster internal registry, run some test, and clean the cluster at the end (in a finaly). If the "run some test" task failed, re-running only that task would fail instantaneously as, the cluster is gone, we did not provision a new one, …

I guess if a pipeline has a clean up in the finally, then rerun is ought to fail with suitable error. And that is expected, as a outcome of attempting a retry.

How do we know which task we need to re-run in addition to the failed one ?
We could think of adding a way to declare such required dependency, so that we can re-run all the required task before the one that failed and that we want to re-run. This, however, add more complexity and verbosity to the API, for something that "might" happen from time to time.

Similarly, at present, a task may miss an input, which was let's say the output of the previous task. Then, it is expected that the retry will fail as well. Idea of having a retry as on-demand, is a user makes a manual judgement and invokes a retry. It is part of Non goal, to manage the state of the failed pipeline by tektoncd.

Taking the example of kubernetes/kubernetes (from the TEP), I understand the need, but as commented, the /retest re-running only the failed jobs is something that is handled on its own, by a specific component. In addition, to compare prow's /retest with Tekton we need to make some assumption: /retest re-run jobs (failed one) with the assumption (it's a given) that a job stands on its own, aka does not depend on anything else than itself. If we were to compare, each prow job should be seen either as a single TaskRun (running one-off task) or a PipelineRun (running a pipeline). And thus, one kubernetes/kubernetes, if you have 10 jobs (that generate 10 ci status), you would have 10 pipelines ; re-running the failed one works just fine as each pipeline are standing on their own. Even better, the component that handles /rerun for prow could be re-used for tekton pipelines as well (and even if not using prow, a component that handle this, could work with a lot of different construct in addition to tekton pipelines).

This is correct, somehow, I could not communicate, that this support is to be used to invoke retry failed tasks manually. And thus, intention is not to make somehow the failed pipeline pass. If for example, a task run is not of the type "one-off", and needs some dependencies to be met before it can begin, then it is well within the reason that it fails again if those dependencies are not met. The idea here is, suppose the failure was merely due to a fetch failure or something trivial, then the entire progress of pipeline which was very resource consuming is not lost. And because such reattempts could be unsuccessful in some cases (perpetually), does it indicate that we should not support retry at all?

Somehow some pipelines are very resource consuming, for example they may be running some stress tests or benchmarks and may or may not have interdependency between each other. The manual invocation of retrying the failed tasks can save us the resources that may have consumed, if the whole thing has to be rerun. For example, after running a set of benchmarks, losing all the progress because there was connect failures while executing the taskRun that posted the result. A retry would fix that. ( A very crude example, just to illustrate the point, actually one would normally add retry for such a step.)

Starting a pipeline all over again from the last run's configuration is already supported by tektoncd cli (tkn pipeline start --use-pipelinerun string ).

@ghost
Copy link

ghost commented May 10, 2021

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label May 10, 2021
@ScrapCodes
Copy link
Contributor Author

I will update the TEP soon, on the basis of the discussions above.

@chhsia0
Copy link
Contributor

chhsia0 commented May 10, 2021

/assign @afrittoli

@chhsia0
Copy link
Contributor

chhsia0 commented May 10, 2021

@vdemeester Would you like to be assigned to this TEP?

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign afrittoli
You can assign the PR to them by writing /assign @afrittoli in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2021
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2021
@ScrapCodes
Copy link
Contributor Author

Hello @afrittoli and @vdemeester , thank you for reviewing the TEP and helping with the feedback, all of that was helpful in steering the direction that this TEP will move.

I have tried to apply the feedback and updated the TEP, please take a look again?

@bobcatfish
Copy link
Contributor

/assign

@bobcatfish
Copy link
Contributor

hey @ScrapCodes !! I want to +1 to @afrittoli 's point about how this is similar to partial pipeline execution (#422 (review))

Partial pipeline execution (tektoncd/pipeline#50) is one of our oldest issues and is in our roadmap and @jerop previously created a proposal around disabling a task in a pipeline which at the time we tabled b/c not enough folks were convinced that we needed to tackle the underlying problem, but im wondering if being able to partially execute a pipeline would solve the problem you describe. for example imagine a scenario like this:

  1. you run a long running pipeline with "pipelinerun bar", "task foo" fails
  2. via a command line tool (e.g. tkn) you could run something like tkn pipelinerun copy --skip-failed bar or something (design very tbd haha) which would look at the pipelinerun bar and generate a new pipelinerun with the same pipeline but disabling the failed tasks

so TL;DR i agree with the problem statement you are describing, but also im thinking we could include the requirements and use cases from tektoncd/pipeline#50 and the disabling a task in a pipeline proposal (it even mentions a use case in the problem statement which sounds very much like yours!! "When running a very long Pipeline, there may be a transient failure and the user may want to only rerun the Tasks that failed..."), what do you think?

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented May 22, 2021

Hello @bobcatfish, thanks for linking that discussion(shame on me, that I was unable to find it), it does seem to be a prior art 😄, and use cases can be borrowed too.

AFAIU, there is one subtle difference in our requirements, when we say partial pipeline execution or checkpointing, it does seem that tekton's liability is slightly more than just naively rerunning a task. That brings in the issue of what happens to all the other things(pods,pvcs,... which may have vanished even if pipeline state is intact) that a tasks depends on, when it is started at a later time than the rest of the pipeline. Managing all that is beyond our scope, because a task can be written in so many ways.

However, in this TEP's case, e.g. a pipeline is running a task which "supports retry" , and it fails. Since, it supports retry the liability is of the "pipeline/task" author to ensure that it can be retried on a failure. In this case, tekton has a limited liability of rerunning the task naively. In this way, we can support a reasonably common use case, without solving a very hard problem.

Do you think this subtle difference make sense?

Disabled task, is an interesting topic, can I suggest it is significant enough to stand on it's own as an independent TEP. I can work on it, if you like it that way. But, by keeping the surface of this TEP a bit constrained, we may support a narrower set of use cases, and at the same time easier(may be quicker, could be included in current release) to get feedback, approval and implement as well. Currently retries are already supported, we just need to support an on-demand invocation of the same. Copying existing pipeline run with some tasks disabled, can this be achieved by skipping the task with some when expression as well? For those not using tkn cli, it can be some extra work to copy all the pipelineRun tasks that failed.

I may be wrong in my understanding, just sharing so that I can be corrected. Since you have already worked and invested a lot of thoughts on this, would you and/or @jerop like to co author this?

@afrittoli
Copy link
Member

Oh, indeed, we arbitrarily delete history, I did not expect that.
I think the reasoning behind that is that the TaskRun status is stored in the PipelineRun before we reset it.

you mean, history is preserved in RetriesStatus field.

Oh, I see, RetriesStatus is part of the TaskRun itself, I got it wrong again 😅
Thank you for clarifying!

@ScrapCodes
Copy link
Contributor Author

@afrittoli Would like to take another look! Even if your overall stance remain unchanged, you can resolve those comments that are clarified 😄

@vdemeester
Copy link
Member

I think I'm convinced that it makes sense to introduce a "retry by resuming after failure" feature at the level of a pipelinerun!
[…]
I'm thinking we could still use partial execution to make this happen

I am still not convinced this is at the pipelinerun level we need this, even at all tbh. This feature can be "done" through well written tasks and / or might not even be needed in most cases. Point being, the question we have to ask ourselves is how much do we want to add complexity to the Pipeline/PipelineRun reconciler for what gain ? (how many user will benefit from this, how hard will it be to use this feature, will there be requirement for this to work correctly, …)

  • How do we ensure we re-run only the tasks that need to be run ?
  • Does this require task to be idempotent to be able to skip them ?
  • How can we mark some task as not skipable ?
  • Can the use cases this TEP solves be handled better with a different design ? multiple pipeline, event driven, … I think the ML use case could be solved that way. On multiple pipeline, event driven, etc. I see a parallel with gating of pipeline for example.
  • Related to above, are we solving this because we are writing huge pipeline that do a lot (and thus when something is flakey in the middle, all have to run again) ?

The more I think about it, the more I am convinced this could / should be tackle by a higher construct, something like Tekton Workflows #464 for example. I'd rather keep Pipeline as simple and flexible as it can be, and build on top 👼🏼 .

@bobcatfish
Copy link
Contributor

bobcatfish commented Jul 2, 2021

Oh, indeed, we arbitrarily delete history, I did not expect that.

haha me too @afrittoli - i was surprised to realize how taskrun retries were implemented XD

Perhaps @joseblas or @bobcatfish or @abayer have more context on the reason behind that choice?

im guessing it was just b/c it was easy - i dont remember realizing at the time how this was actually being implemented. (this isnt the only case of taskrun behavior being added just to support being executed via pipelinerun, i.e. a hidden coupling b/w taskruns and pipelineruns, for example tektoncd/pipeline#1069)

The more I think about it, the more I am convinced this could / should be tackle by a higher construct, something like Tekton Workflows #464 for example.

I like that idea a lot @vdemeester !! maybe implementing this in workflows is the way to go.

For the questions you raised re. how this functionality would work, I'd like to point again toward tektoncd/pipeline#50 and the disabling/partial execution proposal - we'd address these questions by adding an interface for explicitly disabling tasks in a pipeline, with the ability to explicitly provide params and workspaces as needed.

Somehow, I could not understand the benefit of having partial execution as part of the proposal. I can still add it, do you think we can add?

@ScrapCodes the reason we need partial execution is b/c this is the feature that this proposal is requesting.

From my understanding (let me know if this is wrong) this proposal is saying:
a. 'i need a way to execute just part of my pipeline (i.e. the failed tasks and everything after it)'
b. also 'i want to be able to do this within the context of one pipelinerun'

The proposal is combining these together into one feature request, where the functionality in (a) is implicit - i.e. the user says "retry this pipelinerun from the point it fails at" and then gets (a) automatically.

I'm suggesting we start by adding the functionality in (a) as part of the pipelinerun interface - because we're going to need it regardless of whether or not this is a feature of pipelineruns - and if this going to be a feature of a layer above pipelineruns (such as workspaces), we absolutely need it to be explicit.

For example, let's say we want to use this feature for this example pipeline. The pipeline looks kinda like this:

  1. fetch-from-git: provides 'output' workspace with git source
  2. skaffold-unit-tests: uses 'output' workspace
  3. build-skaffold-web & build-skaffold-app run after the unit tests, both also use the workspace and build images, providing IMAGE_DIGEST results
  4. deploy-web & deploy-app use IMAGE_DIGEST results from the build tasks (and also the output workspace)

Let's say we had 'retry from failure' for this pipeline, and build-skaffold-web failed, but build-skaffold-app succeeded. The PipelineRun would have a state like this:

  • Succeeded tasks: fetch-from-git, skaffold-unit-tests, build-skaffold-app
  • Failed tasks: build-skaffold-app
  • Tasks that didn't run: deploy-web (its a bit of a race, if build-skaffold-app was faster than build-skaffold-web, it may have run), deploy-app

What would happen when the task is re-run from the failure point?

The resulting run would only partially execute the pipeline:

  • it wouldn't run the tasks the succeeded
  • it would need to resume from a state that looks like the tasks that succeeded had all just completed, then determine what to run next (in this case, build-skaffold-app and probably deploy-web as well)
  • it would need to provide all of the results + workspace contents that were provided by the successful tasks

Let me know if this still isn't clear: what im trying to say is that the requirements for this feature include the requirements for partial execution, so we're going to need that feature either way.

@Tomcli
Copy link
Contributor

Tomcli commented Jul 3, 2021

@ScrapCodes Maybe we can break it down into multiple smaller TEPs. In our original TEP, one of the key features we want to get out of this TEP is to retry individual task on demand in a pipeline. It seems like the current task retry is not a very good implementation, so we can use a new TEP to focus on that.

As of some other features like partial pipeline execution would be helpful but it could get very complicated on figure out the correct I/O if the partial pipeline is running on a new pipelineRun. So I would suggest let's break it down into these steps.

  1. TEP for how to make individual task to retry on demand. Also figure out what should be the proper way to retry a task.
  2. TEP for updating the pipelinerun reconciler to resume the pipeline status from failed to running if a taskRun is retried on demand.
  3. TEP for partial pipeline
  4. TEP for pipeline retry where the pipeline controller is responsible for figuring out which failed tasks need to be trigger retry.

From KFP-Tekton, we care more about 1 and 2 from above because most of our use cases are to retry one or two tasks on demand. 3 and 4 is more like nice to have to improve the user experience, but we don't really need it in most scenarios.

@ScrapCodes
Copy link
Contributor Author

My understanding is to begin with 3. and then work on 1 & 2. Soon, I will start the TEP for it.

Thanks @jerop and @bobcatfish, since you have already worked on it, would you like to co-author / author? (@bobcatfish has already indicated her preference.) 😄

@bobcatfish
Copy link
Contributor

Sounds like a good approach to me @ScrapCodes !! @Tomcli I'm a bit less clear on definitely pursuing 1 + 2 as you described but like you said maybe they warrant their own TEPs and I'll be able to understand the need a bit better if we pursue them separately.

@ScrapCodes I'm very happy to co-author as long as I'm not driving it haha XD If you want to collaborate in a Google doc or via hackmd I'm happy to contribute!

@ScrapCodes
Copy link
Contributor Author

ScrapCodes commented Jul 9, 2021

Added comments to: Disabling a Task in a Pipeline

@Tomcli
Copy link
Contributor

Tomcli commented Jul 9, 2021

Sounds like a good approach to me @ScrapCodes !! @Tomcli I'm a bit less clear on definitely pursuing 1 + 2 as you described but like you said maybe they warrant their own TEPs and I'll be able to understand the need a bit better if we pursue them separately.

@ScrapCodes I'm very happy to co-author as long as I'm not driving it haha XD If you want to collaborate in a Google doc or via hackmd I'm happy to contribute!

Thanks @bobcatfish, 1 + 2 is the building blocks for how to retry a taskRun on demand. There are some discussions on how the taskRun retry could be done better, so we think is better to discuss that on a separate TEP.

Once we agreed on how taskRun retry should be done and the corresponding reconcile logics, we can apply them to 3 + 4 since pipelineRun retry needs to retry the failed taskRun on demand.

@jerop
Copy link
Member

jerop commented Jul 26, 2021

@ScrapCodes @Tomcli is this TEP on hold as we explore TEP-0077: Partial pipeline execution, or are you pursuing both of them separately?

@afrittoli
Copy link
Member

/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2021
@Tomcli
Copy link
Contributor

Tomcli commented Jul 26, 2021

@ScrapCodes @Tomcli is this TEP on hold as we explore TEP-0077: Partial pipeline execution, or are you pursuing both of them separately?

Yes we are breaking down this TEP into smaller pieces, so we can keep the TEP on hold until we solve all the prerequisites.

@pritidesai
Copy link
Member

on hold while we persue TEP-0077

@tekton-robot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2021
@tekton-robot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 30, 2021
@dibyom
Copy link
Member

dibyom commented Dec 6, 2021

@ScrapCodes are you still actively working on this TEP? Or should we close this for now and re-open it later?

@ScrapCodes
Copy link
Contributor Author

@dibyom , I am currently not pursuing this TEP. Happy to close it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Opened
Development

Successfully merging this pull request may close these issues.

10 participants