Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable sandbox (pause) image #2020

Closed
rosti opened this issue Feb 6, 2020 · 34 comments
Closed

Configurable sandbox (pause) image #2020

rosti opened this issue Feb 6, 2020 · 34 comments
Labels
area/ecosystem kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@rosti
Copy link

rosti commented Feb 6, 2020

Originally posted by @Random-Liu in #1796 (comment)

Can we make pause image repository and tag configurable?

People may build and use a different pause image. :)

Here is an example PR: https://github.com/kubernetes/kubernetes/pull/87827

Example PR by @Random-Liu kubernetes/kubernetes#87827

Moving the discussion here for the sake of keeping the v1beta3 issue relatively clean and on track.

/kind api-change
/area ecosystem
/assign

@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API area/ecosystem labels Feb 6, 2020
@rosti
Copy link
Author

rosti commented Feb 6, 2020

According to the kubeadm charter, we expect that users have configured their container runtimes properly to be able to be used in a Kubernetes cluster. As such, kubeadm refrains from having to start, stop, configure or in any way try to change the configuration and state of the container runtime itself.
However, there is one notable exception to this and that is Docker itself. That was caused by its sheer popularity and the fact that SIG Node is maintaining the large "docker shim" component as part the kubelet.
At this point, kubeadm supplies the kubelet with the --pod-infra-container-image command line option to override the pause image. This makes sure the pause image can be pre-pulled before use and air-gaped environments don't try to access the Internet. However, this works only for Docker, as this option is used solely by the docker shim.
Other container runtimes, like containerd and CRI-O, have their separate options for sandbox image configuration. They are not propagated by the kubelet and that's probably because the kubelet cannot have code to modify all the container runtimes out there. Neither does kubeadm at that matter. In addition to that, CRI does not have a sandbox image override option.
Hence, I don't think that we should expose a config option in the kubeadm config for pause image override, because that option is going to be used only for image pre-pulling and the docker shim.
The is a lot of uncertainty about the latter and what path it takes on and the former is completely out of scope if kubeadm cannot ensure that the pre-pulled image is going to be used by the container runtime.

@rosti
Copy link
Author

rosti commented Feb 6, 2020

@rosti rosti mentioned this issue Feb 6, 2020
16 tasks
@fabriziopandini
Copy link
Member

I agree that kubeadm should not enter in the business of configuring CRIs, unless this is not done through the kubelet.
So the question here are:

  • should we ask to sig-node what are the plans for the --pod-infra-container-image wrt to other CRIs
  • should we, as a kubeadm team, provide support for setting --pod-infra-container-image via config only for docker, that as far as we know, it is the most used CRI

@ereslibre
Copy link
Contributor

ereslibre commented Feb 10, 2020

I also agree that configuring the CRI is out of the scope for kubeadm.

Adding the PauseImage via config when only one runtime will honor this setting does not look to me like a good abstraction for our config. Maybe it's the most used one, but I'd rather prefer to add documentation instead when users are relying on Docker and require this very specific feature.

In general, we can add to the documentation that if you want to configure the pause image, you have several options:

  • No specific pause image needs

    • Docker: we will default it to ImageRepository/pause/3.1, whatever ImageRepository is based on the configuration, and whatever pause image version the current kubeadm version has in its constants.
    • Other container runtimes: we do not attempt to configure, their default configuration is what rules -- check what their settings are on your current distribution.
  • Need to set a specific pause image

    • All container runtimes: provide documentation on how to configure the container runtime:
      • Docker: how to make sure that kubeadm won't override with the dynamic env file? systemd drop-in?
      • Other container runtimes: check their specific documentation on how to set the pause image before calling to kubeadm.

@rosti
Copy link
Author

rosti commented Feb 10, 2020

  • should we ask to sig-node what are the plans for the --pod-infra-container-image wrt to other CRIs

This command line flag is tightly coupled to the docker shim. So I suspect that its long term life is going to be connected to that (rather than the kubelet itself).
Is the docker shim moving out? Is kubelet getting in on the business of configuring more than just Docker? I do think that the answers to those questions can only be given by SIG Node.

  • should we, as a kubeadm team, provide support for setting --pod-infra-container-image via config only for docker, that as far as we know, it is the most used CRI

We already do that through kubelet's docker shim/command line.

Also, I don't see a clear user benefit here (in kubeadm managing the sandbox image). Image pre-pull for air-gaped env? - That can be done by users way before the k8s packages are installed.

@BenTheElder
Copy link
Member

Is there any way today to tell kubeadm today "hey actually I'm using this pause image"?

While I don't need kubeadm to configure the CRI I'm aware of what image the CRI implementation is using and I'd like kubelet flag and the image pull to match (or just no image pulling, which seems to be possible in 1.13+ with --skip-phases=preflight).

Not configuring this properly in kubeadm indeed breaks airgapping.

Also worth noting: IIRC kubelet has an image GC blacklist internally of the pod sandbox image specifically, so you probably want kubelet to know about this until we get a more flexible whitelis t.. (at some point I want to revive an old KEP for a configurable whitelist)

Other than the image pulling (which is never desirable in our case..) and the image GC thing, in practice the CRI implementation implements the pod sandbox and it should be an implementation detail what this image is without any kubeadm involvement, unless kubeadm wants to start trying to manage CRI implementations.

@BenTheElder
Copy link
Member

Is the docker shim moving out? Is kubelet getting in on the business of configuring more than just Docker? I do think that the answers to those questions can only be given by SIG Node.

TBD, though there is provisional support now for building kubelet w/o dockershim, I think eventually it will move out.

Also, I don't see a clear user benefit here (in kubeadm managing the sandbox image). Image pre-pull for air-gaped env? - That can be done by users way before the k8s packages are installed.

IMHO it's mildly useful to make sure the flag aligns, but it would be just as good for kind at least to just not have kubeadm manage the pause image at all and leave us to supply this flag.

Probably the main reason this is a thing is for people configuring ImageRepository to use a mirror consistently..?

IMHO this is a somewhat questionable feature to begin with though, users could instead configure docker / containerd / cri-o to use a mirror for k8s.gcr.io at that layer.

@BenTheElder
Copy link
Member

BenTheElder commented Apr 22, 2020

The other problem with NOT having a structured field for this in kubeadm config is that kubeadm also doesn't have a structured way to discover what will be used, so you can't automate aligning these cleanly when using CRI instead of dockershim.

At best you can do kubeadm config images list | grep pause and then insert that into the relevant implementation config which is pretty gross and assumes that we'll always have the substring pause in just that one image.

It's also entirely possible to implement the CRI spec without the pause image or a totally custom pause image for that matter.

@neolit123
Copy link
Member

neolit123 commented Apr 22, 2020

Probably the main reason this is a thing is for people configuring ImageRepository to use a mirror consistently..?

users (especially in China) have air-gap local registries where they rely on kubeadm config images to tell them what they need to prepull for a working cluster. by doing that they are also prepulling the sandbox container.

from there CR docs need to tell them where to apply this image in the CR config. from my outdated investigation this is not exactly well documented by the CRI providers.

70% of the kubeadm users use dockershim and "pause" is managed for them (minus upgrades)
for the rest (30%) CRI mostly works, except that we are still getting the occasional k/kubeadm k/k issues about the sandbox not being configured properly.

At best you can do kubeadm config images list | grep pause and then insert that into the relevant implementation config which is pretty gross and assumes that we'll always have the substring pause in just that one image.

1.18 kubeadm has structured output for kubeadm config images list, IIRC.
but i agree that using kubeadm to tell CRI config what is the sandbox feels backwards.

While I don't need kubeadm to configure the CRI I'm aware of what image the CRI implementation is using and I'd like kubelet flag and the image pull to match (or just no image pulling, which seems to be possible in 1.13+ with --skip-phases=preflight).

--skip-phases=preflight is viable for skipping prepull.
one potential improvement is adding ImagePullPolicy to kubeadm's config.

i think the dockershim changes will tell us what is going to happen in terms of the sandbox image. the two options are:

  • like @rosti is saying above - stop managing pause and document it
    • easier for us
    • more churn for new users
    • users manage pause upgrades too
  • continue managing pause in kubeadm and pass the relevant image as a dockershim flag:
    • complicated for us (it already is + dockershim will be a new service)
    • dockershim's future state of flags/componentconfig is unknown
    • upgrading pause needs at least docs updates

@BenTheElder
Copy link
Member

BenTheElder commented Apr 22, 2020

users (especially in China) have air-gap local registries where they rely on kubeadm config images to tell them what they need to prepull for a working cluster. by doing that they are also prepulling the sandbox container.

... precisely? but only if kubeadm is managing this image / flag. Do you have any additional reasons for managing this image in kubeadm?

from there CR docs need to tell them where to apply this image in the CR config. from my outdated investigation this is not exactly well documented by the CRI providers.

It is abnormal to override this in a container runtime implementing pods because the sandbox image is part of the pods. However it's not difficult to find the pod sandbox image option in containerd CRI plugin config, it's no less well documented than anything else.

https://github.com/containerd/cri/blob/master/docs/config.md#cri-plugin-config-guide

sandbox_image

70% of the kubeadm users use dockershim and "pause" is managed for them (minus upgrades)
for the rest (30%) CRI mostly works, except that we are still getting the occasional k/kubeadm k/k issues about the sandbox not being configured properly.

I don't know what percentage kind users are but the issue here is that there isn't not only does this not work properly it doesn't functionally make sense.

Even for the dockershim users kubeadm only needs to be specifying this if you're overriding the image registry. Otherwise kubelet defaults this just fine, and the image pulling for pause is of questionable value.

1.18 kubeadm has structured output for kubeadm config images list, IIRC.

Unfortunately this output while "structured" contains no more data, it's still just a grab bag list of images.

i think the dockershim changes will tell us what is going to happen in terms of the sandbox image.

I see no related details there -- it's still just in dockershim mode the pod is implemented by us and we specify this and in CRI mode it's only relevant to the kubelet imageGC.

like @rosti is saying above - stop managing pause and document it

As I said if you don't manage this, usage of imageRepository will become problematic, users will wind up reimplementing imageRepository for this image anyhow.

The entire imageRepository concept could be obviated by setting a comparable mirror at the runtime level though.

If users override nothing "manage pause upgrades" is a no-op except when airgapped. The new kubelet / dockershim will have a new default pause image.

External CRI implementations upgrade on their own schedule and aren't tied to kubernetes versions.

@neolit123
Copy link
Member

neolit123 commented Apr 22, 2020

... precisely? but only if kubeadm is managing this image / flag. Do you have any additional reasons for managing this image in kubeadm?

only prepull / air-gap, i'd say.

Is there any way today to tell kubeadm today "hey actually I'm using this pause image"?

is the request on the kind side related to?
kubernetes-sigs/kind#1471

isn't --ignore-preflight-errors=all (which kind is doing AFAIK) sufficient to stop kubeadm from pulling any images during init/join?

actually it has to be --skip-phases=preflight because pre-flight still pulls. so isn't that sufficient of a workaround?

As I said if you don't manage this, usage of imageRepository will become problematic, users will wind up reimplementing imageRepository for this image anyhow.

that is a side effect, but we may have to tolerate this.

@neolit123
Copy link
Member

if we end up adding #524 with support for a pull-policy of Never this can also skip pre-pull completely, but this will not align it with the k8s pull polices.

alternatively we can break down preflight into sub-phases and this way one can skip ..=preflight/prepull

@BenTheElder
Copy link
Member

the kind issue is just what brought me here, we'll need to employ workarounds for older k8s versions anyhow, but I dont think this is a good state to be in going forward with CRI becoming more common (e.g. CAP*)

actually it has to be --skip-phases=preflight because pre-flight still pulls. so isn't that sufficient of a workaround?

yeah, which sadly 1.11 / 1.12 don't appear to have.

that still leaves us doing substring matches on the images which I guess works but feels wrong.

that is a side effect, but we may have to tolerate this.

yeah, again I might suggest documenting the simpler way: https://docs.docker.com/registry/recipes/mirror/
https://github.com/containerd/cri/blob/master/docs/registry.md
https://github.com/containers/image/blob/master/docs/containers-registries.conf.5.md (cri-o)

@neolit123
Copy link
Member

neolit123 commented Apr 22, 2020

the kind issue is just what brought me here, we'll need to employ workarounds for older k8s versions anyhow, but I dont think this is a good state to be in going forward with CRI becoming more common (e.g. CAP*)

yeah, which sadly 1.11 / 1.12 don't appear to have.

it's hard to maintain such a big skew. i'd drop support for these older versions at this point.
with the current skew (without LTS KEPs etc) we try to close <1.16 tickets in k/k and k/kubeadm unless the user wants to confirm the bug is present in the skew or if a developer has the time to check.

related issue to the kind issue, that was logged today; the user is asking why they have two pause images:
kubernetes/kubernetes#90326

i see this as a potential miss configuration of containerd. yet this combined with the kind issue maybe signals that it's time kubeadm to stop managing the sandbox image.

@BenTheElder
Copy link
Member

BenTheElder commented Apr 26, 2020

i see this as a potential miss configuration of containerd

er, the pause image is not kubernetes version specific unless we're talking about dockershim because that is itself a partial container runtime. containerd, cri-o etc. should use whatever pod sandbox image meets their needs.

for example, using the latest containerd with any kubernetes version works fine, but will use the 3.2 image, while older kubeadm will assume the 3.1 image.

you can make containerd / the CRI plugin use whatever pause image kubeadm expects, but there's not a good reason to except for the double image pulling.


FTR we are currently rewriting containerd config to match roughly kubeadm config images list | grep pause as a minimum fix for now, but may not do this in the future.

Ideally we should use defer to the CR for podsandbox management, which probably doesn't need to involve kubeadm except that's it's going to be best practice to prevent imageGC of the sandbox image since all pods need it. (currently this is done via the kubelet sandbox image flag even if not using dockershim, imho it should be an image GC denylist in the future)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2020
@fabriziopandini
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2020
@neolit123 neolit123 added this to the v1.21 milestone Dec 2, 2020
@bboreham
Copy link

bboreham commented Feb 3, 2021

Related: aws/eks-distro#140

@neolit123
Copy link
Member

minor update here.

with the dockershim extraction supposedly happening in the next few releases, kubeadm no longer has to manage pause.
we are going to stop passing --pod-infra-container-image to the kubelet (dockershim).

but kubeadm continuing to manage the pause image will allow users to see the default pause bundled with a k8s version. this is a good idea in general, related to security fixes in pause, although it doesn't change often.

the API change requested here (which can land in v1beta3) still makes sense for pre-pull - e.g. users might choose to use the GCR pause, but configure an image repository for everything else.

@BenTheElder
Copy link
Member

we are going to stop passing --pod-infra-container-image to the kubelet (dockershim).

One side effect of this flag is kubelet will never try to GC this image ever (and only this image), which may be desirable.

Kind now avoids the preflight / pull, but I still think it makes sense for other users.

I've also seen chitchat about CRI implemented without the pause image entirely / your CRI implementation will have a tested default independent of the Kubernetes release cycle, so it may make sense to decouple this entirely.

At the very least making it configurable in kubeadm would make it easier to align both on the CRI preferred version.

@neolit123
Copy link
Member

One side effect of this flag is kubelet will never try to GC this image ever (and only this image), which may be desirable.

that's a good point. if the kubelet actually guarantees that via this flag we should be passing it from kubeadm for all CRs.
this reminded me of this bug:
"Kubelet garbage-collects the pause image, which might be unpullable in ContainerD"
kubernetes/kubernetes#81756

@BenTheElder
Copy link
Member

Yeah that's it exactly, I investigated the sources previously and found the behavior described in that issue: the flag is not gated only to dockershim as documented and imageGC at least will skip this image. It also has a default.

I think the better approach would be a way to tell kubelet "don't ever delete these images" but I don't know if anyone has the bandwidth to get something like that through (technically easy, but features take a lot of effort to get approved). It could also arguably be done in the CRI instead (rejecting deletes?)

@pacoxu
Copy link
Member

pacoxu commented Feb 4, 2021

In kubernetes/kubernetes#97438 (comment), containerd uses its configuration in /etc/containerd/config.toml of sandbox_image . Should we check it during installation(configure it or warning if neeed) or before GC the pause image?

@dims
Copy link
Member

dims commented Feb 4, 2021

cc @adisky

@adisky
Copy link

adisky commented Feb 4, 2021

As I can understand there are two things here -- for which same flag pod-infra-container-image is used

  1. Overriding the pause image for container runtime with the kubelet flag
  2. Skipping Eviction of pause image by kubelet GC.

[1] applies only to docker-shim, [2] applies to all container runtimes. (possibly missed/side-effect)

I am not sure but AFAIU we do not want [1] for all container runtimes but we are concerned about [2].
To solve [2], we have following options

  • Introduce a new configurable option to specify pause image for kubelet. As pod-infra-container-image will go away soon.
  • Suggested by @BenTheElder above It could also arguably be done in the CRI instead (rejecting deletes?)

@neolit123
Copy link
Member

@pacoxu

In kubernetes/kubernetes#97438 (comment), containerd uses its configuration in /etc/containerd/config.toml of sandbox_image . Should we check it during installation(configure it or warning if neeed) or before GC the pause image?

i don't think the kubelet would like to check the config.toml.

@neolit123
Copy link
Member

@adisky @BenTheElder

similar to the cgroup driver problem, instead of passing a flag to the kubelet for feature X, the kubelet should be able to ask about the configuration at the CRI socket.

i can see that crictl info already shows the full config for containerd (inc cgroup driver and sandbox_image), but that doesn't work for cri-o and dockershim.

@adisky
Copy link

adisky commented Feb 4, 2021

@neolit123 Thanks for the pointer, I too tried containerd returns "sandboxImage":"k8s.gcr.io/pause:3.2" in Info of StatusResponse https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1/api.pb.go#L6054, which is not a standard. How kubelet will know in what parameter container runtime is passing the pause image info?

@neolit123
Copy link
Member

it has to be standardized eventually, until then the kubelet may have to accept the image to not be GCed.

@neolit123
Copy link
Member

neolit123 commented Feb 25, 2021

any objections to starting to pass --pod-infra-container-image to the kubelet for all container runtimes (not only docker) in 1.21?

as discussed above this flag tells the kubelet image GC to not remove it.

code in kubeadm is here:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/kubelet/flags.go#L82-L83

@BenTheElder
Copy link
Member

I still think there should be a more flexible way to not evict critical images, but short of that any configured "don't evict this pause image" autodetected from CRI SGTM.

The desync between dockershim's / kubeadm's preferred pause image and CRI has caused us to avoid preflight entirely (so as not to break airgapping). GC we have to blanket disable for now anyhow though.

@neolit123
Copy link
Member

any objections to starting to pass --pod-infra-container-image to the kubelet for all container runtimes (not only docker) in 1.21?

here is the PR:
kubernetes/kubernetes#99476

@neolit123 neolit123 modified the milestones: v1.21, v1.22 Mar 9, 2021
@neolit123 neolit123 added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Mar 31, 2021
@neolit123
Copy link
Member

neolit123 commented May 19, 2021

had a chat with @fabriziopandini about this API change for v1beta3 (which is to be released in v1.22)
both of us are not in favor of adding such a field.

main argument was already covered in discussions above. in the future the kubelet should stop allowing the user to configure the "pod infra container". CRI implementors and the kubelet ideally should be able to communicate what image to:

  • use for the "parent container" of pods
  • not GC in the kubelet GC loop

with respect to kubeadm image pulls:

  • in the future kubeadm might drop pre-pull of pause entirely, but ideally this should happen after the above CRI / kubelet alignment. for the time being users can continue using the kubeadm pull to pull the latest / official k8s pause.
  • @fabriziopandini and me are in favor of adding support imagePullPolicy when pulling kubeadm images #524 in v1beta3, which will allow users to skip image pull entirely via a policy Never.

@BenTheElder
Copy link
Member

Connecting the dots:

I agree that long term it probably makes sense to move towards decoupling awareness of the pause image, since it's an implementation detail of the CRI implementation at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ecosystem kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests