Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign device plugin reset #747

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SchSeba
Copy link
Collaborator

@SchSeba SchSeba commented Jul 25, 2024

This commit introduces a new redesign on how the operator resets the device plugin

  • use a general nodeSelector to avoid updating the daemonset yaml
  • reduce clusterRole to role for config-daemon pod delete option ( better security)

Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

coveralls commented Jul 25, 2024

Pull Request Test Coverage Report for Build 11775513002

Details

  • 86 of 118 (72.88%) changed or added relevant lines in 4 files are covered.
  • 9 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+1.4%) to 47.024%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/cluster.go 45 59 76.27%
controllers/sriovnetworknodepolicy_controller.go 21 39 53.85%
Files with Coverage Reduction New Missed Lines %
controllers/helper.go 1 69.96%
api/v1/helper.go 8 75.75%
Totals Coverage Status
Change from base Build 11775378543: 1.4%
Covered Lines: 7103
Relevant Lines: 15105

💛 - Coveralls

controllers/drain_controller.go Outdated Show resolved Hide resolved
controllers/drain_controller.go Outdated Show resolved Hide resolved
controllers/drain_controller.go Outdated Show resolved Hide resolved
@SchSeba SchSeba changed the title This commit introduces a new redesign on how the operator resets the … Redesign device plugin reset Jul 25, 2024
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 5 times, most recently from a3e8a58 to ba8d6ef Compare July 30, 2024 09:05
@SchSeba
Copy link
Collaborator Author

SchSeba commented Aug 19, 2024

@zeeke @adrianchiris @ykulazhenkov when you have time please take a look on this :)

Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments

controllers/drain_controller.go Outdated Show resolved Hide resolved
pkg/utils/cluster.go Outdated Show resolved Hide resolved
controllers/drain_controller.go Outdated Show resolved Hide resolved
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 4 times, most recently from d719a8c to 0981cec Compare September 3, 2024 13:04
Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Overall design looks good to me but we must be super sure about all of these annotation states. I'm pretty scared of having a deadlock in some production cluster :)

controllers/helper.go Outdated Show resolved Hide resolved
controllers/drain_controller_helper.go Outdated Show resolved Hide resolved
controllers/drain_controller_test.go Outdated Show resolved Hide resolved
controllers/drain_controller_helper.go Outdated Show resolved Hide resolved
pkg/consts/constants.go Outdated Show resolved Hide resolved
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 3 times, most recently from 146b706 to 95e9d84 Compare October 7, 2024 13:01
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 3 times, most recently from 8152db1 to e2e2ad2 Compare October 10, 2024 11:38
Copy link
Collaborator

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review.

left with drain controller and its helper.

its hard to review added logic since a bunch of code was moved while new code added in existing functions in the same commit :\

}

if newObj.GetLabels()[key] != value {
log.Log.V(2).Info("LabelObject(): Annotate object",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Annotate -> Label

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

err := c.Patch(ctx,
newObj, patch)
if err != nil {
log.Log.Error(err, "annotateObject(): Failed to patch object")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix func name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -161,3 +162,40 @@ func AnnotateNode(ctx context.Context, nodeName string, key, value string, c cli

return AnnotateObject(ctx, node, key, value, c)
}

// LabelObject adds label to a kubernetes object
func LabelObject(ctx context.Context, obj client.Object, key, value string, c client.Client) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: keep it internal to the package for now ? since you only use LabelNode in controller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -228,6 +228,18 @@ func (c *openshiftContext) OpenshiftAfterCompleteDrainNode(ctx context.Context,
return false, err
}

value, exist := mcp.Annotations[consts.MachineConfigPoolPausedAnnotation]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related to this PR ?

if not, do you prefer to submit a separate PR for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it from this PR

pkg/daemon/daemon.go Outdated Show resolved Hide resolved
@@ -67,12 +67,17 @@ const (
MachineConfigPoolPausedAnnotationIdle = "Idle"
MachineConfigPoolPausedAnnotationPaused = "Paused"

SriovDevicePluginLabel = "sriovnetwork.openshift.io/device-plugin"
SriovDevicePluginLabelEnabled = "Enabled"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use lower case for label values ?

no special preference on my end. but i seldom see label values that are CamelCase

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -185,42 +173,17 @@ func syncPluginDaemonObjs(ctx context.Context,
data.Data["ReleaseVersion"] = os.Getenv("RELEASEVERSION")
data.Data["ResourcePrefix"] = vars.ResourcePrefix
data.Data["ImagePullSecrets"] = GetImagePullSecrets()
data.Data["NodeSelectorField"] = GetDefaultNodeSelector()
data.Data["NodeSelectorField"] = GetDefaultNodeSelectorForDevicePlugin()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we merge this with the labels we set on sriov network config daemon ?

that way we have sriov device plugin deployed wherever sriov network config daemon is deployed AND has the device-plugin enabled label.

just in case we have leftovers OR we changed the selector for config daemon.

also should we clean this label (and possibly some annot while at it) from node obj if config daemon is not targeting them ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@adrianchiris
Copy link
Collaborator

adrianchiris commented Oct 10, 2024

General comment:

to me it seems (much ?) simpler that the config daemon will be in charge of updating its own node label to evict DP pod.
in this case it just requests cordon from the drain controller.

the drain controller will then handle only what it was supposed to do originally (handle cordon/drain related operations)
which is easier to comprehend IMO.

we should discuss this.

controllers/drain_controller_helper.go Outdated Show resolved Hide resolved
pkg/daemon/daemon.go Outdated Show resolved Hide resolved
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 4 times, most recently from 3d20dff to 46affb0 Compare November 7, 2024 10:07
Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

great design work!

always deploy sriov network device plugin and use a label to enable or
disable it on the nodes

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants