Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drift Detection and Correction for Cross-Cluster State Management #732

Open
DinaBelova opened this issue Dec 9, 2024 · 5 comments
Open
Assignees
Labels
epic Large body of work, can be broken down into individual issues

Comments

@DinaBelova
Copy link
Collaborator

Goals

Problem Statement: With the completion of Cross-Cluster State Management Templates, HMC now has centralized ServiceTemplate deployment across clusters. However, without drift detection and correction, clusters risk deviating from intended configurations over time, causing inconsistency and requiring manual interventions to restore alignment.
Epic Goal: Build upon the Cross-Cluster State Management Templates by integrating automated drift detection and correction, utilizing Sveltos to ensure clusters continuously conform to centrally defined ServiceTemplates. This system will enable automated detection, reporting, and correction of configuration drift, increasing consistency and minimizing operational load.

Major deliverables

  • Drift Detection and Correction System for ServiceTemplates:
    • Integrate with Sveltos to enable ContinuousWithDriftDetection on ServiceTemplates, with automated re-syncing to central configurations.

Who it benefits
Customer Business:

  • Ensures clusters remain consistent with central configurations, reducing unplanned downtime and operational instability.
  • Automates manual drift correction efforts, lowering resource requirements for ongoing infrastructure management.

Platform Engineering Teams:

  • Provides real-time visibility into drift events, allowing for faster resolution and reduced monitoring effort.
  • Ensures clusters stay aligned with the intended state, improving system reliability and predictability.

Mirantis:

  • Expands the HMC offering with critical drift detection and correction functionality, increasing value for customers managing complex, large-scale Kubernetes estates.
  • Reduces support demands by enabling proactive configuration management across clusters.

Acceptance criteria

  • Automated Drift Detection on All ServiceTemplate Deployments:
    • Ensure that all ServiceTemplate deployments with syncMode: ContinuousWithDriftDetection detect configuration drift and notify the central management cluster.
  • Automated and Manual Drift Correction Capabilities:
    • Verify that configuration drift can be automatically corrected, with an option for manual override on select templates.
  • Profile-Based Drift Monitoring:
    • Drift detection scope can be limited to specific clusters and namespaces through labels, ensuring targeted monitoring.

Assumptions

  • Platform Leads / Engineers will manage drift detection configurations by creating CRs in the management cluster.
  • Sveltos will be the primary tool for fulfilling drift detection and correction
  • Customers may optionally use other systems for application configuration management, with the alternative drift detection system supporting those workflows/pipelines.

Limitations

  • Sveltos drift detection capabilities and limitations
  • Performance overhead of continuous drift monitoring at scale
  • Network latency impact on drift detection accuracy
  • Limited ability to detect configuration changes made outside ServiceTemplate system
  • Resource constraints for continuous monitoring across large cluster fleets

Out of scope

  • API or UI development for new configuration interfaces.
  • Performance optimizations for large-scale drift detection.
  • Integration with 2A observability platform until those epics are developed.
    • Centralized Reporting and Alerting of Drift Events
    • Audit Logging for Compliance Tracking

User stories
As a Platform Lead:

  • I want automated drift detection for ServiceTemplates so I can ensure configuration consistency
  • I want to receive notifications when drift occurs so I can investigate root causes
  • I want automatic correction of detected drift so configurations stay aligned with templates
  • I want to override automatic correction for specific templates when manual control is needed

As a Platform Engineer:

  • I want to view drift status across my clusters so I can identify problematic deployments
  • I want to manually trigger drift correction so I can control timing of changes
  • I want to configure drift detection scope so I can focus on critical services
  • I want to exclude certain resources from drift detection so I can allow authorized local changes
@DinaBelova DinaBelova added the epic Large body of work, can be broken down into individual issues label Dec 9, 2024
@DinaBelova DinaBelova changed the title [placeholder] Drift Detection and Correction for Cross-Cluster State Management Drift Detection and Correction for Cross-Cluster State Management Dec 9, 2024
@DinaBelova DinaBelova moved this from Todo to In Progress in Project 2A Dec 19, 2024
@DinaBelova
Copy link
Collaborator Author

@wahabmk can you please add the most recent research results that you're running atm through sveltos docs?

@wahabmk
Copy link
Contributor

wahabmk commented Dec 30, 2024

I couldn't find a way in the Sveltos docs currently that would indicate how to watch for drift changes and trigger a custom notification, so I studied the code to understand how drift detection is implemented and how we could possibly implement a notification mechanism.

Sveltos Drift Detection & Correction

There are 2 ways to run the "drift-detection-manager" in Sveltos:

In both of these methods, the drift detection CRDs are installed in the managed cluster. The ResourceSummary object (which is part of these CRDs) contains a list of objects to watch for drift and is also created in the managed cluster.

The high-level flow of how the drift correction actually happens is as below:

  1. The "drift-detection-manager" watches resources and if it detects changes in the resources (based on hash values), it updates the status of the ResourceSummary object.
  2. The change in ResourceSummary object triggers the processResourceSummary() function in the "addon-controller", which updates the status of the associated ClusterSummary object. The processResourceSummary() function runs in a separate go-routine which is run when setting up the ClusterSummaryReconciler.
    • 2a. During this process, the "addon-controller" sets ClusterSummary.status.featureSummaries[].hash = nil. This is important to note because the ClusterSummary objects exists in the management cluster so we could potentially use it to watch for drift rather than ResourceSummary.
  3. The update to ClusterSummary triggers the ClusterSummary.Reconcile function which then re-deploys the drifted resources using the feature handler for each feature. The feature handlers are defined in the createFeatureHandlerMaps() function.

How we can detect drift for notifications

We could use either the ResourceSummary or ClusterSummary object to detect drift, but keep in mind that the source of truth for determining that drift occurred is ResourceSummary.status.helmResourcesChanged=true.

Using ResourceSummary to detect drift

  • HMC will create a watcher for each cluster where we want to check for drift.
  • The watcher will get the kubeconfig for the cluster and watch for changes in ResourceSummary object.
  • When ResourceSummary.status.helmResourceChanged=true, the watcher can trigger a notification.
  • PROS:
    • Using ResourceSummary is better as it is the source of truth for determining if drift has occurred.
  • CONS:
    • More work to implement in HMC.
    • Watcher might miss the change in status based on how its implemented and how quick Sveltos corrects the drift.
  • NOTE: We may be able to achieve this without making significant changes to HMC by using Sveltos HealthCheck object with a Lua script to detect changes to ResourceSummary. This object already has pre-defined notification mechanisms. See https://projectsveltos.github.io/sveltos/observability/example_crashloopbackoff_notification/ for more details.

Using ClusterSummary to detect drift

  • We are already watching for changes to ClusterSummary in the HMC controller.
  • Based on the findings in 2a) above, we can determine if a drift occurred with:
prevHash := nil
if hash == nil:
  if isReady == false:
    - can't be sure if there is a drift since cluster is not ready so ignore
  else:
    if prevHash == nil:
      - this means that it is the 1st time that resource is provisioned
      - or that the HMC has (re)started so it is observing the resource for the 1st time
      - so in either case we ignore
    else:
      if status == Provisioning:
        - drift occurred so trigger notification
prevHash := hash
  • We check for isReady == false because Sveltos sets hash = nil if the cluster is not ready along with status = Failed.
  • We can use the IsClusterReadyToBeConfigured() function to check if cluster is ready.
  • PROS:
    • Potentially less work to implement in HMC.
    • Also no need to access the managed cluster as the ClusterSummary object already exists in the management cluster.
  • CONS:
    • Might be a bit hacky because it depends on how Sveltos implements correction for the detected drift.

Using metrics exposed by Sveltos

A third option is that we can use the projectsveltos_total_drifts metric exposed by Sveltos to have observability over drifts. See: https://projectsveltos.github.io/sveltos/getting_started/install/grafanadashboard/#12-drifts.
NOTE: This might be suitable to do as part of the Observability Epic.

UPDATE

Sveltos might implement a knob to send notification on detected drift as part of its drift detection mechanism. See: https://projectsveltos.slack.com/archives/C046P825BBL/p1735586788580649.

@wahabmk
Copy link
Contributor

wahabmk commented Jan 1, 2025

Acceptance criteria

  • Automated Drift Detection on All ServiceTemplate Deployments:
    • Ensure that all ServiceTemplate deployments with syncMode: ContinuousWithDriftDetection detect configuration drift and notify the central management cluster.

Configuring drift detection/correction and applying it to clusters will be implemented in #834.

Notification for drift is not something that Sveltos includes out of the box. The #732 (comment) describes possibles approaches we could take to implement it but need to try out as part of working on #835.

  • Automated and Manual Drift Correction Capabilities:
    • Verify that configuration drift can be automatically corrected, with an option for manual override on select templates.

I couldn't find any mechanism in Sveltos to manually trigger drift correction. Based on how drift correction has been implemented in Sveltos (summarized in #732 (comment)), the way to manually trigger correction would be to trigger the ClusterSummaryReconciler but its not ideal. TODO: Asking on Sveltos slack might give us some suggestions.

  • Profile-Based Drift Monitoring:
    • Drift detection scope can be limited to specific clusters and namespaces through labels, ensuring targeted monitoring.

Currently the way labels are used to target clusters for deploying services is that the ClusterDeployment will match only 1 cluster, whereas the MultiClusterService may match multiple clusters using labels. So if both are matching a particular cluster, the services and drift configuration for the one with higher priority will be applied to the cluster.

@wahabmk
Copy link
Contributor

wahabmk commented Jan 1, 2025

As a Platform Lead:

  • I want automated drift detection for ServiceTemplates so I can ensure configuration consistency #834
  • I want to receive notifications when drift occurs so I can investigate root causes #835

The options for notification or observability for detected drift is discussed in #732 (comment)

  • I want automatic correction of detected drift so configurations stay aligned with templates #834
  • I want to override automatic correction for specific templates when manual control is needed

The same comment for manually triggering drift correction as in #732 (comment).

@wahabmk
Copy link
Contributor

wahabmk commented Jan 1, 2025

As a Platform Engineer:

  • I want to view drift status across my clusters so I can identify problematic deployments #835

The options for notification or observability of detected drift is discussed in #732 (comment)

  • I want to manually trigger drift correction so I can control timing of changes

Same comment for manually triggering drift correction as in #732 (comment).

  • I want to configure drift detection scope so I can focus on critical services

If this is referring to scope of drift detection as determined by labels, then same comment applies as in the last point in #732 (comment). But if this is referring to opting certain services out of drift detection, then see the comment below.

  • I want to exclude certain resources from drift detection so I can allow authorized local changes #834
  • In the current implementation, we don't have any way to opt out of drift detection for lets say 1 out of 3 services deployed. The reason for this is because we map ClusterDeployment -> (Sveltos) Profile and MultiClusterService -> (Sveltos) ClusterProfile and the syncMode: ContinuousWithDriftDetection option is applied to all services (which get translated to helm charts on Sveltos objects) as described in https://projectsveltos.github.io/sveltos/features/configuration_drift/#configuration-drift.

  • We can, however, exclude certain Kubernetes objects deployed by these helm charts with "Ignore Annotation" and "Ignore Fields" if that is what is intended by this use case but there is no option currently to opt a helm chart as a whole out of drift detection if syncMode: ContinuousWithDriftDetection is set.

  • One possible workaround is that within the Mirantis official helm chart (which is used by the ServiceTemplate) we can add projectsveltos.io~1driftDetectionIgnore annotation to all Kubernetes objects if .Values.ignoreDrift=true. Then if we want to exclude a particular service from drift detection, we can create the ClusterDeployment as in the YAML below. All Kubernetes objects deployed for ingress-nginx would then be ignored for drift detection. However, this workaround relies on the pre-creating the helm chart with .Values.ignoreDrift=true.

apiVersion: hmc.mirantis.com/v1alpha1
kind: ClusterDeployment
metadata:
  name: wali-dev-1
  namespace: hmc-system
spec:
  . . .
  services:
    - template: kyverno-3-2-6
      name: kyverno
      namespace: kyverno
    - template: ingress-nginx-4-11-0
      name: ingress-nginx
      namespace: ingress-nginx
      values: |
        ignoreDrift: true
    - template: cert-manager-1-16-2
      name: cert-manager
      namespace: cert-manager
  syncMode: ContinuousWithDriftDetection
. . .
  • Yet another option to achieve this would be not to map ClusterDeployment -> (Sveltos) Profile but to create a separate Sveltos Profile object for each of the services defined in ClusterDeployment. This is a large change though as it will be a fundamental change in design.

@alex-shl alex-shl added this to k0rdent Jan 3, 2025
@alex-shl alex-shl moved this to In Progress in k0rdent Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Large body of work, can be broken down into individual issues
Projects
Status: In Progress
Development

No branches or pull requests

2 participants