Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add affinity to jobTemplate #2758

Merged

Conversation

takuyaymd
Copy link
Contributor

@takuyaymd takuyaymd commented Dec 2, 2024

Description

Add affinity, nodeSelector and torelations to jobTemplate in manager/index (corrector, creator, saver).

Related Issue

Versions

  • Vald Version: v1.7.14
  • Go Version: v1.23.3
  • Rust Version: v1.82.0
  • Docker Version: v27.3.1
  • Kubernetes Version: v1.31.2
  • Helm Version: v3.16.2
  • NGT Version: v2.3.4
  • Faiss Version: v1.9.0

Checklist

Special notes for your reviewer

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced deployment flexibility with new affinity configurations for corrector, creator, and saver components.
    • Introduced nodeSelector and tolerations options for more granular pod scheduling control.
    • Expanded metrics and observability properties for improved monitoring and tracing.
  • Improvements

    • Updated configuration options in the values.yaml file for better resource management and deployment strategies.
    • Enhanced job template configurability with support for affinity, nodeSelector, and tolerations.

Copy link
Contributor

coderabbitai bot commented Dec 2, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes in this pull request introduce significant enhancements to the ValdRelease custom resource and related Helm chart configurations. Key updates include the addition of affinity, nodeSelector, and tolerations properties for various components, improving pod scheduling flexibility. The Helm templates and values files have been updated to support these new configurations, allowing users to define scheduling constraints more granularly. Additionally, the discoverer field has been added to enhance integration with the discoverer service, improving operational efficiency.

Changes

File Path Change Summary
charts/vald-helm-operator/crds/valdrelease.yaml Added affinity, nodeSelector, and tolerations properties in spec.corrector, spec.creator, spec.saver, and spec.manager for ValdRelease.
charts/vald/templates/_helpers.tpl Introduced affinity, nodeSelector, and tolerations keys in the job template definition for enhanced job configuration.
charts/vald/values.schema.json Added new properties for affinity, nodeSelector, and tolerations in the corrector, creator, and saver sections of the JSON schema.
charts/vald/values.yaml Updated manager.index.corrector and manager.index.creator sections to include nodeSelector, tolerations, affinity, and a new discoverer field.

Warning

Rate limit exceeded

@takuyaymd has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 7 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 59f1a66 and 1736fc7.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@takuyaymd takuyaymd force-pushed the feature/manager-index/add-affinities-to-job-fix branch from 59f1a66 to 1736fc7 Compare December 2, 2024 09:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
charts/vald/values.schema.json (4)

14074-14128: Schema structure follows Kubernetes conventions but could be more specific

The addition of affinity, nodeSelector, and tolerations configurations across corrector, creator, and saver components is well-structured and follows Kubernetes API conventions. However, there are opportunities for improvement:

  1. The schema uses generic { "type": "object" } for complex nested structures in affinity configurations. Consider defining these more explicitly to provide better validation and auto-completion support.
  2. The descriptions could be more detailed to guide users on proper usage.

Consider enhancing the schema with more specific definitions, for example:

 "preferredDuringSchedulingIgnoredDuringExecution": {
   "type": "array",
   "description": "node affinity preferred scheduling terms",
-  "items": { "type": "object" }
+  "items": {
+    "type": "object",
+    "required": ["weight", "preference"],
+    "properties": {
+      "weight": {
+        "type": "integer",
+        "description": "Weight associated with matching the corresponding nodeSelectorTerm"
+      },
+      "preference": {
+        "type": "object",
+        "properties": {
+          "matchExpressions": {
+            "type": "array",
+            "items": {
+              "type": "object",
+              "required": ["key", "operator"],
+              "properties": {
+                "key": { "type": "string" },
+                "operator": { "type": "string" },
+                "values": {
+                  "type": "array",
+                  "items": { "type": "string" }
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
 }

Also applies to: 15230-15233, 16452-16456, 16471-16525, 17280-17283, 18502-18506, 22246-22300, 23051-23054, 24273-24277


14074-14128: Consider documenting recommended affinity configurations

The addition of affinity configurations to index management components (corrector, creator, saver) provides valuable control over pod scheduling. However, users would benefit from guidance on recommended configurations for different scenarios.

Consider:

  1. Adding examples in the documentation for common use cases:
    • Co-locating index management jobs with specific agent pods
    • Spreading jobs across nodes for load balancing
    • Ensuring jobs run on nodes with specific hardware (e.g., SSD storage)
  2. Documenting best practices for affinity rules to prevent scheduling conflicts between components

Also applies to: 16471-16525, 22246-22300


15230-15233: Enhance nodeSelector description with examples

The nodeSelector schema is correctly implemented but could benefit from a more descriptive comment to guide users.

Consider enhancing the description:

 "nodeSelector": {
   "type": "object",
-  "description": "node selector"
+  "description": "nodeSelector is the simplest recommended form of node selection constraint. nodeSelector is a field of PodSpec that specifies a map of key-value pairs. For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (it can have additional labels as well). Example: { \"disktype\": \"ssd\", \"environment\": \"production\" }"
 }

Also applies to: 17280-17283, 23051-23054


16452-16456: Define specific schema for toleration objects

The tolerations schema could be more specific to provide better validation and documentation.

Consider enhancing the schema:

 "tolerations": {
   "type": "array",
   "description": "tolerations",
-  "items": { "type": "object" }
+  "items": {
+    "type": "object",
+    "properties": {
+      "key": {
+        "type": "string",
+        "description": "The taint key that the toleration applies to"
+      },
+      "operator": {
+        "type": "string",
+        "description": "Operator represents a key's relationship to the value",
+        "enum": ["Equal", "Exists"]
+      },
+      "value": {
+        "type": "string",
+        "description": "Value is the taint value the toleration matches to"
+      },
+      "effect": {
+        "type": "string",
+        "description": "Effect indicates the taint effect to match",
+        "enum": ["NoSchedule", "PreferNoSchedule", "NoExecute"]
+      },
+      "tolerationSeconds": {
+        "type": "integer",
+        "description": "TolerationSeconds represents the period of time a toleration tolerates the taint"
+      }
+    }
+  }
 }

Also applies to: 18502-18506, 24273-24277

charts/vald-helm-operator/crds/valdrelease.yaml (1)

Line range hint 8409-14284: Consider adding schema validation examples

While the schema is technically correct, it would be helpful to add examples in the CRD documentation to demonstrate proper usage of these scheduling controls.

Example:

# Example affinity configuration
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/e2e-az-name
          operator: In
          values:
          - e2e-az1
          - e2e-az2
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3d51a12 and 59f1a66.

📒 Files selected for processing (4)
  • charts/vald-helm-operator/crds/valdrelease.yaml (9 hunks)
  • charts/vald/templates/_helpers.tpl (2 hunks)
  • charts/vald/values.schema.json (9 hunks)
  • charts/vald/values.yaml (3 hunks)
🔇 Additional comments (10)
charts/vald/values.schema.json (1)

Line range hint 14074-24277: Document security implications of scheduling configurations

The addition of scheduling controls (affinity, nodeSelector, tolerations) could have security implications if not properly configured.

Consider adding security documentation that covers:

  1. How these configurations affect pod isolation
  2. Best practices for maintaining security boundaries
  3. Recommendations for production deployments

Run the following to check for existing security documentation:

charts/vald/templates/_helpers.tpl (3)

899-902: LGTM! Clean affinity implementation

The affinity section is well-implemented using the existing vald.affinity template for consistency across components.


936-939: LGTM! Clean nodeSelector implementation

The nodeSelector section is properly implemented using toYaml for map rendering.


940-943: LGTM! Clean tolerations implementation

The tolerations section is properly implemented using toYaml for array rendering.

charts/vald/values.yaml (3)

3430-3453: LGTM! Well-structured scheduling configurations for corrector job

The scheduling configurations (nodeSelector, tolerations, affinity) are properly structured with appropriate empty defaults that can be configured per deployment needs.


3566-3589: LGTM! Well-structured scheduling configurations for creator job

The scheduling configurations mirror the corrector job structure, maintaining consistency across components.


3696-3719: LGTM! Well-structured scheduling configurations for saver job

The scheduling configurations maintain consistency with other index manager components, following the same pattern.

charts/vald-helm-operator/crds/valdrelease.yaml (3)

8409-8453: LGTM: Comprehensive affinity configuration added

The affinity configuration is well-structured and complete, including all necessary scheduling controls:

  • nodeAffinity with both preferred and required scheduling rules
  • podAffinity for co-location rules
  • podAntiAffinity for anti-co-location rules

Also applies to: 9778-9822, 13121-13165


9085-9087: LGTM: NodeSelector schema properly defined

The nodeSelector fields are correctly defined with x-kubernetes-preserve-unknown-fields to allow arbitrary node labels for flexible scheduling.

Also applies to: 10258-10260, 13599-13601


9765-9769: LGTM: Tolerations schema properly defined

The tolerations array schema is correctly defined with x-kubernetes-preserve-unknown-fields to support all Kubernetes toleration configurations.

Also applies to: 10939-10943, 14280-14284

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
charts/vald/values.schema.json (3)

14074-14128: Consider documenting recommended affinity configurations for corrector

The scheduling configurations are well-structured, but since the corrector handles index corrections, consider:

  • Adding examples in documentation for recommended affinity rules to:
    • Co-locate with related index pods for better performance
    • Ensure high availability across zones
    • Handle resource-intensive correction operations

Would you like me to help draft documentation with these recommendations?

Also applies to: 15230-15233, 16452-16456


16471-16525: Consider performance implications of creator scheduling

The scheduling configurations are well-implemented. For the creator component:

  • Consider documenting recommended nodeSelector patterns for nodes with:
    • Higher CPU/memory resources for efficient index creation
    • Local SSDs for better I/O performance

Would you like me to help draft performance-oriented configuration examples?

Also applies to: 17280-17283, 18502-18506


22246-22300: Consider data locality for saver scheduling

The scheduling configurations are well-implemented. For the saver component:

  • Consider documenting recommended affinity rules for:
    • Co-location with persistent storage
    • Ensuring index saves occur close to where data is stored
    • High availability patterns for saved indices

Would you like me to help draft data locality-oriented configuration examples?

Also applies to: 23051-23054, 24273-24277

charts/vald-helm-operator/crds/valdrelease.yaml (2)

8409-8453: LGTM! The affinity schema implementation looks correct.

The affinity configuration follows Kubernetes standards and is consistently implemented across corrector, creator, and saver components.

Consider adding description fields to document the purpose and usage of these scheduling features. For example:

 affinity:
+  description: "Scheduling constraints to control pod placement using node/pod affinity rules"
   type: object
   properties:
     nodeAffinity:
+      description: "Constraints for scheduling pods on specific nodes based on node labels"
       type: object

Also applies to: 9778-9822, 13121-13165


Line range hint 8409-14284: Consider documenting recommended scheduling patterns

These scheduling features enable powerful deployment scenarios. Consider documenting recommended patterns for:

  • Using pod affinity to co-locate related components for better performance
  • Using pod anti-affinity to spread components for high availability
  • Using nodeSelector for hardware-specific requirements (e.g., GPU nodes)
  • Using tolerations to allow deployment on specialized nodes
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 59f1a66 and 1736fc7.

📒 Files selected for processing (4)
  • charts/vald-helm-operator/crds/valdrelease.yaml (9 hunks)
  • charts/vald/templates/_helpers.tpl (2 hunks)
  • charts/vald/values.schema.json (9 hunks)
  • charts/vald/values.yaml (3 hunks)
🔇 Additional comments (10)
charts/vald/values.schema.json (2)

14074-14128: Schema structure follows Kubernetes conventions correctly

The schema implementation for affinity, nodeSelector, and tolerations:

  • Correctly matches Kubernetes API object structures
  • Maintains consistency across all components (corrector, creator, saver)
  • Includes appropriate descriptions for all fields

Also applies to: 15230-15233, 16452-16456, 16471-16525, 17280-17283, 18502-18506, 22246-22300, 23051-23054, 24273-24277


Line range hint 14074-24277: Verify schema completeness against Kubernetes API

Let's verify that all standard Kubernetes scheduling fields are included.

✅ Verification successful

All standard Kubernetes scheduling fields are properly defined in the schema

Based on the verification results, the schema includes all required Kubernetes scheduling fields:

  • nodeSelector with proper type and description
  • tolerations with proper structure and description
  • affinity with all required subfields:
    • nodeAffinity (including preferred/required scheduling terms)
    • podAffinity (including preferred/required scheduling terms)
    • podAntiAffinity (including preferred/required scheduling terms)

Each field has appropriate descriptions and follows the standard Kubernetes scheduling schema structure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify schema completeness against Kubernetes scheduling fields

# Check if we have all standard Kubernetes scheduling fields
echo "Checking for required scheduling fields..."
required_fields=("affinity" "nodeSelector" "tolerations")
for field in "${required_fields[@]}"; do
  echo "Checking $field..."
  rg -i "\"$field\".*:.*{?\s*\"type\"" charts/vald/values.schema.json
done

# Verify affinity subfields
echo "Checking affinity subfields..."
affinity_fields=("nodeAffinity" "podAffinity" "podAntiAffinity")
for field in "${affinity_fields[@]}"; do
  echo "Checking $field..."
  rg -i "\"$field\".*:.*{?\s*\"type\"" charts/vald/values.schema.json
done

Length of output: 2095


Script:

#!/bin/bash
# Description: Verify schema completeness against Kubernetes scheduling fields with fixed regex

# Check if we have all standard Kubernetes scheduling fields
echo "Checking for required scheduling fields..."
required_fields=("affinity" "nodeSelector" "tolerations")
for field in "${required_fields[@]}"; do
  echo "Checking $field..."
  rg -i "\"${field}\".*type" charts/vald/values.schema.json
done

# Verify affinity subfields
echo "Checking affinity subfields..."
affinity_fields=("nodeAffinity" "podAffinity" "podAntiAffinity")
for field in "${affinity_fields[@]}"; do
  echo "Checking $field..."
  rg -i "\"${field}\".*type" charts/vald/values.schema.json
done

# Also check for any scheduling related properties
echo "Checking overall scheduling properties..."
rg -i "scheduling|affinity|selector|toleration" charts/vald/values.schema.json

Length of output: 15576

charts/vald/templates/_helpers.tpl (3)

899-902: LGTM: Affinity configuration properly implemented

The affinity configuration is correctly added to the job template following Kubernetes best practices. The implementation allows for flexible pod scheduling control through nodeAffinity, podAffinity, and podAntiAffinity settings.


936-939: LGTM: NodeSelector configuration properly implemented

The nodeSelector configuration is correctly added to the job template, allowing for node-level scheduling constraints.


940-943: LGTM: Tolerations configuration properly implemented

The tolerations configuration is correctly added to the job template, enabling pods to be scheduled on nodes with matching taints.

charts/vald/values.yaml (3)

3430-3453: LGTM: Corrector job scheduling configuration properly implemented

The scheduling configuration (affinity, nodeSelector, tolerations) for the corrector job is well-structured with appropriate defaults and helpful comments.


3566-3589: LGTM: Creator job scheduling configuration properly implemented

The scheduling configuration (affinity, nodeSelector, tolerations) for the creator job is well-structured with appropriate defaults and helpful comments.


3696-3719: LGTM: Saver job scheduling configuration properly implemented

The scheduling configuration (affinity, nodeSelector, tolerations) for the saver job is well-structured with appropriate defaults and helpful comments.

charts/vald-helm-operator/crds/valdrelease.yaml (2)

9085-9087: LGTM! The nodeSelector schema is properly defined.

The nodeSelector field is correctly implemented with x-kubernetes-preserve-unknown-fields to allow arbitrary label selectors.

Also applies to: 10258-10260, 13599-13601


9765-9769: LGTM! The tolerations schema follows Kubernetes standards.

The tolerations array is properly defined to allow standard Kubernetes toleration objects.

Also applies to: 10939-10943, 14280-14284

@kmrmt kmrmt merged commit 9bd9034 into vdaas:main Dec 3, 2024
172 of 176 checks passed
@takuyaymd takuyaymd deleted the feature/manager-index/add-affinities-to-job-fix branch December 3, 2024 05:07
vdaas-ci pushed a commit that referenced this pull request Dec 3, 2024
kmrmt pushed a commit that referenced this pull request Dec 3, 2024
Co-authored-by: takuya <49614391+takuyaymd@users.noreply.github.com>
This was referenced Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants