Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix minikube start failure #2771

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kmrmt
Copy link
Contributor

@kmrmt kmrmt commented Dec 10, 2024

Description

Fix minikube start failure at readreplica e2e test.

Related Issue

Versions

  • Vald Version: v1.7.14
  • Go Version: v1.23.3
  • Rust Version: v1.82.0
  • Docker Version: v27.3.1
  • Kubernetes Version: v1.31.2
  • Helm Version: v3.16.2
  • NGT Version: v2.3.4
  • Faiss Version: v1.9.0

Checklist

Special notes for your reviewer

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced Kubernetes cluster management with new commands for volume snapshots.
    • Added configuration for a new read replica in the agent setup.
    • Introduced a new YAML file for Kind cluster setup.
    • New input parameter for GitHub Actions to specify Kind environment requirements.
    • New phony targets for managing Kind volume snapshots.
    • New command added for Docker CLI installation in the build process.
  • Improvements

    • Updated Docker installation process for better integration with the build environment.
    • Enhanced logging options for Minikube startup.
    • Increased timeout and sleep duration for pod readiness checks.
    • Improved specificity in .gitignore for Rust-related files.
    • Adjusted deployment commands for Vald read replicas.
    • Enhanced configuration for health checks and logging in the read replica setup.
  • Bug Fixes

    • Improved readiness checks for Kubernetes pods.
    • Commented out readiness checks for pod deployments in the Vald read replica workflow.
    • Updated logging level for better debugging in the read replica configuration.

Copy link
Contributor

coderabbitai bot commented Dec 10, 2024

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 golangci-lint (1.62.2)

level=warning msg="[config_reader] The configuration option run.skip-dirs is deprecated, please use issues.exclude-dirs."
level=warning msg="[config_reader] The configuration option output.format is deprecated, please use output.formats"
level=warning msg="[config_reader] The configuration option linters.govet.check-shadowing is deprecated. Please enable shadow instead, if you are not using enable-all."
level=warning msg="The linter 'exportloopref' is deprecated (since v1.60.2) due to: Since Go1.22 (loopvar) this linter is no longer relevant. Replaced by copyloopvar."
level=warning msg="The linter 'gomnd' is deprecated (since v1.58.0) due to: The linter has been renamed. Replaced by mnd."
level=warning msg="The linter 'execinquery' is deprecated (since v1.58.0) due to: The repository of the linter has been archived by the owner. "
level=error msg="[linters_context] execinquery: This linter is fully inactivated: it will not produce any reports."

📝 Walkthrough
<details>
<summary>📝 Walkthrough</summary>

## Walkthrough

This pull request introduces significant updates to the development and testing infrastructure for a Kubernetes-based project. The modifications enhance cluster management, particularly for Kind (Kubernetes in Docker) environments. Key changes include the installation of the Docker CLI, configuration for volume snapshot support, updates to GitHub Actions workflows, and preparations for read replica deployments. The changes aim to streamline the setup and testing processes across various configuration files and build scripts.

## Changes

| File | Change Summary |
|------|----------------|
| `Makefile.d/e2e.mk` | Updated readiness checks for metrics-server and agent, replaced `minikube/restart` with `kind/vs/restart`. |
| `Makefile.d/minikube.mk` | Added Docker binary prerequisite, updated Minikube start command with Docker driver and logging options, enabled metrics-server addon. |
| `Makefile.d/tools.mk` | Added new `docker-cli/install` phony target for Docker CLI installation. |
| `dockers/ci/base/Dockerfile` | Added `sudo` package and Docker CLI installation step. |
| `dockers/dev/Dockerfile` | Integrated Docker CLI installation step. |
| `Makefile.d/kind.mk` | Added volume snapshot support, new targets for Kind cluster management, Docker binary dependency. |
| `.github/actions/setup-e2e/action.yaml` | Added `require_kind` input parameter and Kind environment setup step. |
| `.github/workflows/e2e.yaml` | Updated E2E job to use Kind, added read replica deployment step. |
| `.gitignore` | Refined Rust-related ignored files and added new ignore rules. |
| `k8s/debug/kind/e2e.yaml` | New Kubernetes Kind cluster configuration file. |
| `Makefile.d/k8s.mk` | Increased sleep duration before checking read replica pod readiness. |
| `.github/helm/values/values-readreplica.yaml` | Added read replica specific name and volume configuration. |
| `.github/actions/e2e-deploy-vald-readreplica/action.yaml` | Commented out readiness checks for pod deployment. |
| `.github/actions/e2e-deploy-vald/action.yaml` | Updated conditions for waiting on pod readiness. |

## Possibly related PRs

- **#2498**: This PR introduces a new workflow for updating actions, which may relate to the overall management of GitHub Actions in the main PR.
- **#2595**: This PR focuses on automatically generating workflows, which could be relevant to the refactoring of GitHub Actions in the main PR.
- **#2596**: This PR adds a check for context cancellation, which may indirectly relate to the robustness of workflows in the main PR.
- **#2639**: This PR includes various enhancements and refactorings, which may connect with the overall improvements in the main PR.
- **#2670**: This PR refactors the Makefile to use absolute paths, which could relate to the changes in the main PR regarding path management.
- **#2673**: This PR backports changes related to the Makefile, which may connect with the refactoring efforts in the main PR.
- **#2696**: This PR introduces a string sorted topological sort, which may relate to the overall improvements in the main PR.
- **#2699**: This PR focuses on preparing for the release, which may connect with the overall enhancements in the main PR.
- **#2743**: This PR refactors Dockerfiles and updates `.gitattributes`, which may relate to the changes in the main PR.
- **#2745**: This PR backports the changes from PR #2743, which may connect with the refactoring efforts in the main PR.
- **#2773**: This PR refactors GitHub Actions, which is directly relevant to the main PR.
- **#2779**: This PR backports changes from PR #2773, making it directly related to the main PR.

## Suggested labels

`area/internal`

## Suggested reviewers

- hlts2
- vankichi

</details>

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@vdaas-ci
Copy link
Collaborator

[CHATOPS:HELP] ChatOps commands.

  • 🙆‍♀️ /approve - approve
  • 🍱 /format - format codes and add licenses
  • /gen-test - generate test codes
  • 🏷️ /label - add labels
  • 🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

Copy link

cloudflare-workers-and-pages bot commented Dec 11, 2024

Deploying vald with  Cloudflare Pages  Cloudflare Pages

Latest commit: 75d1062
Status: ✅  Deploy successful!
Preview URL: https://de39cedf.vald.pages.dev
Branch Preview URL: https://bugfix-e2e-readreplica-minik.vald.pages.dev

View logs

@kmrmt kmrmt changed the title [WIP] fix minikube start failure fix minikube start failure Dec 11, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e06133d and 0473919.

⛔ Files ignored due to path filters (1)
  • hack/docker/gen/main.go is excluded by !**/gen/**
📒 Files selected for processing (6)
  • .github/workflows/e2e.yaml (1 hunks)
  • Makefile.d/e2e.mk (1 hunks)
  • Makefile.d/minikube.mk (2 hunks)
  • Makefile.d/tools.mk (1 hunks)
  • dockers/ci/base/Dockerfile (2 hunks)
  • dockers/dev/Dockerfile (1 hunks)
🔇 Additional comments (7)
Makefile.d/minikube.mk (2)

28-30: LGTM: Explicit Docker driver configuration

The changes properly ensure Docker CLI availability and explicitly specify the Docker driver, which helps prevent driver selection issues during Minikube startup.


35-35: Verify metrics-server requirement

The addition of metrics-server addon might be related to the readreplica e2e test requirements.

✅ Verification successful

Metrics-server requirement is verified for readreplica e2e tests

Based on the analysis of the e2e test files, specifically in Makefile.d/e2e.mk, the metrics-server is indeed required for the readreplica e2e tests. The e2e/actions/run/readreplica target explicitly waits for the metrics-server to be available and ready before proceeding with the readreplica tests:

kubectl wait -n kube-system --for=condition=Available deployment/metrics-server --timeout=$(E2E_WAIT_FOR_START_TIMEOUT)
kubectl wait -n kube-system --for=condition=Ready pod -l k8s-app=metrics-server --timeout=$(E2E_WAIT_FOR_START_TIMEOUT)
kubectl wait -n kube-system --for=condition=ContainersReady pod -l k8s-app=metrics-server --timeout=$(E2E_WAIT_FOR_START_TIMEOUT)

Additionally, the readreplica configuration in k8s/readreplica/configmap.yaml shows metrics-related configurations, indicating that metrics monitoring is an integral part of the readreplica functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if readreplica tests depend on metrics-server

# Search for metrics-server usage in e2e tests
rg -l "metrics-server|metrics\.k8s\.io" --type yaml --type go | grep -i "e2e\|test"

# Check if readreplica manifests have resource requirements
rg -l "readreplica.*resources:" --type yaml

Length of output: 131


Script:

#!/bin/bash
# Let's try a broader search to understand the context better

# Search for metrics-server references in all files
rg -l "metrics-server"

# Search for any metrics-related configurations in test files
rg -l "metrics" test/ e2e/

# Look for resource requirements or monitoring in test files
rg -A 5 "resources:|monitoring" test/ e2e/

# Check if there are any readreplica test files
fd readreplica

Length of output: 732


Script:

#!/bin/bash
# Let's check the content of relevant files to understand the metrics-server requirement

# Check Makefile.d/e2e.mk for metrics-server usage context
cat Makefile.d/e2e.mk

# Check readreplica job manifests for resource requirements
rg -A 5 "resources:" charts/vald/templates/index/job/readreplica/
rg -A 5 "resources:" charts/vald-readreplica/

# Check if metrics-server is mentioned in readreplica related files
rg "metrics" k8s/readreplica/
rg "metrics" pkg/index/job/readreplica/

Length of output: 10509

dockers/ci/base/Dockerfile (2)

126-126: LGTM: Docker CLI installation

Adding docker-cli installation aligns with the Minikube prerequisites and supports the fix for Minikube start failure.


75-75: Verify sudo requirement

The addition of sudo package might be related to Docker operations. However, since we're running as root (USER root:root), verify if sudo is actually needed.

dockers/dev/Dockerfile (1)

140-140: LGTM: Consistent Docker CLI installation

Adding docker-cli installation in the dev environment maintains consistency with the CI environment and ensures Minikube works in both contexts.

Makefile.d/tools.mk (1)

264-272: LGTM: Docker CLI installation implementation

The implementation is well-structured and follows the established pattern for tool installations:

  • Uses official Docker repository
  • Handles architecture detection
  • Includes proper cleanup
  • Follows the same pattern as other tool installations
.github/workflows/e2e.yaml (1)

Line range hint 312-316: Verify the necessity of disabling k3d

The change to use Minikube exclusively (by disabling k3d) for this test might indicate an underlying issue that should be investigated.

Let's check if there are any k3d-specific configurations that might be causing issues:

✅ Verification successful

Let me gather more information about the usage of k3d and minikube in the e2e workflow.


Let me check the actual setup-k3d and setup-minikube actions to understand their configurations and potential differences.


Let me check the e2e workflow file to understand the specific test that's being modified and its requirements.


Let me check the specific test configuration and Makefile to understand the minikube setup.


The switch to Minikube for read replica test is justified and necessary

The read replica test specifically requires CSI Driver and Volume Snapshots support for testing the read replica functionality, as evidenced by the configuration in values-readreplica.yaml which uses:

  • storageClass: csi-hostpath-sc
  • snapshot_classname: "csi-hostpath-snapclass"
  • readreplica.enabled: true

Minikube is explicitly configured in minikube.mk to support these features with the commands:

minikube addons enable volumesnapshots
minikube addons enable csi-hostpath-driver

The comment in minikube.mk also confirms this: "Only use this for development related to Volume Snapshots. Usually k3d is faster."

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for k3d-specific configurations
rg -A 5 "k3d" .github/helm/values/
rg -A 5 "k3d" .github/actions/setup-e2e/

Length of output: 1407


Script:

#!/bin/bash
# Check for k3d and minikube usage in e2e workflow
rg -A 5 "setup-e2e" .github/workflows/e2e.yaml

# Check for any k3d or minikube related issues or discussions
rg -i "k3d|minikube" .github/ISSUE_TEMPLATE/
rg -i "k3d|minikube" .github/pull_request_template.md

Length of output: 2255


Script:

#!/bin/bash
# Check k3d and minikube setup actions
cat .github/actions/setup-k3d/action.yaml
cat .github/actions/setup-minikube/action.yaml

# Check for any related test failures or issues
rg -i "failed.*k3d|k3d.*failed" .github/workflows/
rg -i "failed.*minikube|minikube.*failed" .github/workflows/

Length of output: 4113


Script:

#!/bin/bash
# Check the specific e2e test configuration and its context
rg -B 10 -A 10 'require_k3d: "false"' .github/workflows/e2e.yaml

# Check if there are any minikube-specific configurations in the repository
rg -l "minikube" .

Length of output: 1049


Script:

#!/bin/bash
# Check the read replica test configuration
cat .github/helm/values/values-readreplica.yaml

# Check minikube configuration in Makefile
cat Makefile.d/minikube.mk

# Check if this is the only test using minikube
rg "require_minikube: \"true\"" .github/workflows/

Length of output: 3809

Makefile.d/e2e.mk Outdated Show resolved Hide resolved
.github/workflows/e2e.yaml Outdated Show resolved Hide resolved
@vdaas-ci
Copy link
Collaborator

Profile Report

typevald-agent-ngtvald-lb-gatewayvald-discoverervald-manager-index
cpu
heap
other images

Copy link
Contributor

coderabbitai[bot]
coderabbitai bot previously approved these changes Dec 11, 2024
@vdaas-ci
Copy link
Collaborator

Profile Report

typevald-agent-ngtvald-lb-gatewayvald-discoverervald-manager-index
cpu
heap
other images

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
.github/helm/values/values-readreplica.yaml (3)

19-19: Consider reverting debug logging after troubleshooting

While debug logging is helpful for troubleshooting the minikube start failure, it can impact performance and generate excessive logs. Consider adding a TODO to revert this to info level after the issues are resolved.


42-60: Optimize init containers configuration

The init containers configuration could be optimized:

  1. Consider using IfNotPresent instead of Always for the pull policy of init containers
  2. The configuration is repetitive and could be templated

Consider refactoring to reduce duplication, for example:

initContainers:
  defaults: &initContainerDefaults
    type: wait-for
    image: busybox:stable
    imagePullPolicy: IfNotPresent
    sleepDuration: 2
  containers:
    - <<: *initContainerDefaults
      name: wait-for-discoverer
      target: discoverer
    - <<: *initContainerDefaults
      name: wait-for-agent
      target: agent
    - <<: *initContainerDefaults
      name: wait-for-agent-readreplica
      target: agent-readreplica

Line range hint 19-90: Overall changes look well-structured

The modifications form a cohesive solution for the minikube start failure by:

  1. Enabling detailed logging for troubleshooting
  2. Adding appropriate health checks
  3. Ensuring proper startup order through init containers
  4. Configuring read replica volumes correctly

Consider documenting these startup dependencies and timing requirements in the project's operational documentation to help with future troubleshooting and configuration tuning.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d872637 and ce899a4.

📒 Files selected for processing (1)
  • .github/helm/values/values-readreplica.yaml (3 hunks)
🔇 Additional comments (2)
.github/helm/values/values-readreplica.yaml (2)

89-90: Verify volume configuration consistency

The read replica volume configuration looks correct, but let's ensure:

  1. The PVC name matches any existing claims
  2. The configuration aligns with the snapshot class configuration
✅ Verification successful

Volume configuration is consistent with project standards

The volume configuration is correctly structured and aligns with the project's patterns:

  • The volume name vald-agent-readreplica-pvc follows the established naming convention, similar to other volume configurations like vald-agent-ngt-readreplica-pvc
  • The configuration is properly integrated with the snapshot system, as evidenced by the snapshot templates in charts/vald-readreplica/templates/
  • PVC and volume snapshot configurations are consistently defined across the Helm charts
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check volume and snapshot configurations

# Search for related PVC configurations
rg -g '*.yaml' -g '*.yml' 'vald-agent-readreplica-pvc'

# Check for snapshot class references
rg -g '*.yaml' -g '*.yml' 'csi-hostpath-snapclass'

Length of output: 105


Script:

#!/bin/bash
# Let's broaden the search to find any volume-related configurations

# Search for PVC patterns in yaml files
rg -g '*.yaml' -g '*.yml' 'persistentVolumeClaim|volumeClaimTemplates'

# Search for volume name patterns
rg -g '*.yaml' -g '*.yml' 'volume_name:|volumeName:'

# Search for snapshot-related configurations
rg -g '*.yaml' -g '*.yml' 'snapshot|VolumeSnapshot'

Length of output: 2659


22-29: Verify if 120s initial delay is necessary

The 120-second initial delay for both liveness and readiness probes seems quite long. This could significantly impact:

  • Pod startup time
  • Recovery time during failures
  • Overall deployment time

Let's analyze the startup times to determine if this delay is justified:

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce899a4 and a2ca0c2.

📒 Files selected for processing (5)
  • .github/helm/values/values-readreplica.yaml (3 hunks)
  • .github/workflows/e2e.yaml (2 hunks)
  • Makefile.d/e2e.mk (2 hunks)
  • Makefile.d/k8s.mk (5 hunks)
  • Makefile.d/minikube.mk (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • Makefile.d/minikube.mk
  • Makefile.d/e2e.mk
  • Makefile.d/k8s.mk
  • .github/helm/values/values-readreplica.yaml
🧰 Additional context used
🪛 actionlint (1.7.4)
.github/workflows/e2e.yaml

343-343: property "deploy_vald" is not defined in object type {deploy_vald_readreplica: {conclusion: string; outcome: string; outputs: {pod_name: string}}; setup_e2e: {conclusion: string; outcome: string; outputs: {helm_extra_options: string; image_tags: string}}}

(expression)

🔇 Additional comments (1)
.github/workflows/e2e.yaml (1)

318-318: LGTM: Infrastructure change from Minikube to Kind

The switch to Kind (Kubernetes in Docker) is a good solution to address the Minikube start failure mentioned in the PR objectives.

@@ -347,7 +340,7 @@ jobs:
E2E_TARGET_NAMESPACE=default \
e2e/readreplica
env:
POD_NAME: ${{ steps.deploy_vald_readreplica.outputs.POD_NAME }}
POD_NAME: ${{ steps.deploy_vald.outputs.POD_NAME }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical: Invalid step reference in POD_NAME environment variable

The POD_NAME environment variable references steps.deploy_vald.outputs.POD_NAME, but there is no deploy_vald step in this job. This will cause the workflow to fail.

Apply this diff to fix the reference:

-          POD_NAME: ${{ steps.deploy_vald.outputs.POD_NAME }}
+          POD_NAME: ${{ steps.deploy_vald_readreplica.outputs.POD_NAME }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
POD_NAME: ${{ steps.deploy_vald.outputs.POD_NAME }}
POD_NAME: ${{ steps.deploy_vald_readreplica.outputs.POD_NAME }}
🧰 Tools
🪛 actionlint (1.7.4)

343-343: property "deploy_vald" is not defined in object type {deploy_vald_readreplica: {conclusion: string; outcome: string; outputs: {pod_name: string}}; setup_e2e: {conclusion: string; outcome: string; outputs: {helm_extra_options: string; image_tags: string}}}

(expression)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
.github/workflows/e2e.yaml (1)

343-343: ⚠️ Potential issue

Critical: Invalid step reference in POD_NAME environment variable

The POD_NAME environment variable references steps.deploy_vald.outputs.POD_NAME, but there is no deploy_vald step in this job. This will cause the workflow to fail.

Apply this diff to fix the reference:

-          POD_NAME: ${{ steps.deploy_vald.outputs.POD_NAME }}
+          POD_NAME: ${{ steps.deploy_vald_readreplica.outputs.POD_NAME }}
🧰 Tools
🪛 actionlint (1.7.4)

343-343: property "deploy_vald" is not defined in object type {deploy_vald_readreplica: {conclusion: string; outcome: string; outputs: {pod_name: string}}; setup_e2e: {conclusion: string; outcome: string; outputs: {helm_extra_options: string; image_tags: string}}}

(expression)

🧹 Nitpick comments (3)
.github/workflows/e2e.yaml (1)

344-347: Consider enhancing failure logging.

The failure logging step could be improved by:

  1. Including more context (e.g., pod status, events)
  2. Setting a retention period for the logs

Apply this diff to enhance the logging:

       - name: output logs on failure
         if: failure()
         run: |
-          cat ./gateway.log
+          echo "=== Pod Status ==="
+          kubectl get pods -o wide
+          echo "=== Recent Events ==="
+          kubectl get events --sort-by='.lastTimestamp'
+          echo "=== Gateway Logs ==="
+          cat ./gateway.log
Makefile.d/k8s.mk (2)

Line range hint 135-155: Consider adding validation before deployment.

The deployment target could benefit from pre-deployment validation to ensure manifest correctness.

Add this before the kubectl apply commands:

 k8s/vald/deploy: k8s/vald/manifests
+	@echo "Validating manifests..."
+	@find $(TEMP_DIR)/vald/templates -type f -name "*.yaml" -exec kubectl apply --dry-run=client -f {} \;
 	@echo "Permitting error because there's some cases nothing to apply"

Line range hint 239-270: Enhance pod readiness checks.

The read replica deployment includes good pod readiness checks but could be improved with timeout handling and status reporting.

Apply this diff to enhance the readiness checks:

 	kubectl wait --for=condition=ready pod -l app=vald-agent --timeout=600s
 	kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=vald-readreplica --timeout=600s
+	@echo "=== Deployment Status ==="
+	@kubectl get pods -l app=vald-agent -o wide
+	@kubectl get pods -l app.kubernetes.io/name=vald-readreplica -o wide
+	@if [ $$? -ne 0 ]; then \
+		echo "Deployment failed. Check pod events:"; \
+		kubectl get events --sort-by='.lastTimestamp' | grep -E "vald-agent|vald-readreplica"; \
+		exit 1; \
+	fi
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2ca0c2 and a0c9435.

📒 Files selected for processing (3)
  • .github/actions/e2e-deploy-vald-readreplica/action.yaml (1 hunks)
  • .github/workflows/e2e.yaml (2 hunks)
  • Makefile.d/k8s.mk (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/actions/e2e-deploy-vald-readreplica/action.yaml
🧰 Additional context used
🪛 actionlint (1.7.4)
.github/workflows/e2e.yaml

343-343: property "deploy_vald" is not defined in object type {deploy_vald_readreplica: {conclusion: string; outcome: string; outputs: {pod_name: string}}; setup_e2e: {conclusion: string; outcome: string; outputs: {helm_extra_options: string; image_tags: string}}}

(expression)

🔇 Additional comments (2)
.github/workflows/e2e.yaml (1)

318-318: Verify Kind cluster configuration for read replica tests.

The change from require_minikube to require_kind needs proper Kind cluster configuration to support the read replica functionality.

✅ Verification successful

Based on the review of the setup-e2e action and helm values configuration, I can now provide a final response:

Kind cluster configuration is properly set up for read replica functionality

The change from require_minikube to require_kind is valid as:

  1. The setup-e2e action includes proper Kind cluster setup via make kind/vs/start
  2. The values-readreplica.yaml configures necessary components:
    • CSI hostpath storage class for persistent volumes
    • Snapshot support via csi-hostpath-snapclass
    • Read replica configuration with proper volume management
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify Kind cluster configuration for read replica support.

# Check if Kind cluster configuration exists
fd -t f "kind.*\.yaml" -x cat {} \; | grep -A 5 "extraMounts\|extraPortMappings"

# Check if Kind cluster has the necessary features enabled
kubectl get nodes -o json | jq -r '.items[].spec.podCIDR'

Length of output: 243


Script:

#!/bin/bash
# Let's check the workflow configuration and related files to understand the Kind setup

# Check the workflow file for Kind setup
cat .github/workflows/e2e.yaml

# Look for any Kind configuration files in the repository
fd -t f "kind|k8s|cluster" -x cat {} \;

# Search for any documentation about Kind setup
fd -t f "README|CONTRIBUTING" -x cat {} \;

Length of output: 115445


Script:

#!/bin/bash
# Let's check the setup-e2e action and helm values to understand the Kind configuration

# Check setup-e2e action
cat .github/actions/setup-e2e/action.yaml

# Check helm values for read replica setup
cat .github/helm/values/values-readreplica.yaml

Length of output: 6583

Makefile.d/k8s.mk (1)

Line range hint 114-134: LGTM: Well-structured manifest generation.

The manifest generation target is well-organized and includes all necessary image repository configurations.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/agent/core/ngt/service/ngt.go (1)

890-892: Return a closed channel instead of nil
Returning nil can be risky if any upstream code waits to receive from the channel. Consider returning a closed channel to avoid blocking or upstream panics.

Example refactor:

890     func (n *ngt) Start(ctx context.Context) <-chan error {
891         if n.isReadReplica {
-            return nil
+            c := make(chan error)
+            close(c) // signals no operations needed
+            return c
         }
         ...
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee564f5 and 84c941b.

📒 Files selected for processing (1)
  • pkg/agent/core/ngt/service/ngt.go (1 hunks)

Copy link

codecov bot commented Dec 23, 2024

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 23.93%. Comparing base (d1e7f4c) to head (75d1062).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/agent/core/ngt/service/ngt.go 0.00% 2 Missing and 1 partial ⚠️
hack/docker/gen/main.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2771   +/-   ##
=======================================
  Coverage   23.92%   23.93%           
=======================================
  Files         546      546           
  Lines       54544    54547    +3     
=======================================
+ Hits        13052    13054    +2     
- Misses      40705    40707    +2     
+ Partials      787      786    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kmrmt kmrmt force-pushed the bugfix/e2e/readreplica-minikube-start-failure branch from 3358169 to a3fd605 Compare December 25, 2024 01:07
kmrmt and others added 7 commits December 25, 2024 10:11
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Kosuke Morimoto <kou.morimoto@gmail.com>
@kmrmt kmrmt force-pushed the bugfix/e2e/readreplica-minikube-start-failure branch from a3fd605 to 75d1062 Compare December 25, 2024 01:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
Makefile.d/kind.mk (3)

78-78: 🛠️ Refactor suggestion

Improve reliability of e2e.yaml modification

The current sed command might fail silently if:

  • host.docker.internal is not in /etc/hosts
  • The file pattern isn't found in e2e.yaml
-sed -e 's/apiServerAddress: "127.0.0.1"/apiServerAddress: "$(shell grep host.docker.internal /etc/hosts | cut -f1)"/' $(ROOTDIR)/k8s/debug/kind/e2e.yaml | kind create cluster --name $(NAME)-vs --config - 
+HOST_IP := $(shell grep host.docker.internal /etc/hosts | cut -f1)
+ifeq ($(HOST_IP),)
+	$(error host.docker.internal not found in /etc/hosts)
+endif
+sed -e 's/apiServerAddress: "127.0.0.1"/apiServerAddress: "$(HOST_IP)"/' $(ROOTDIR)/k8s/debug/kind/e2e.yaml > /tmp/e2e_modified.yaml
+test -s /tmp/e2e_modified.yaml || (echo "Failed to modify e2e.yaml" && exit 1)
+kind create cluster --name $(NAME)-vs --config /tmp/e2e_modified.yaml

81-85: 🛠️ Refactor suggestion

Add verification for CRD and controller installation

The installation of CRDs and controller should be verified to ensure successful deployment.

+define verify_resource
+	@echo "Verifying $$1..."
+	@kubectl wait --for=condition=established --timeout=60s crd/$$1 || (echo "Failed to install $$1" && exit 1)
+endef

 kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/$(SNAPSHOTTER_VERSION)/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
+$(call verify_resource,volumesnapshotclasses.snapshot.storage.k8s.io)
 kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/$(SNAPSHOTTER_VERSION)/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
+$(call verify_resource,volumesnapshotcontents.snapshot.storage.k8s.io)
 kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/$(SNAPSHOTTER_VERSION)/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
+$(call verify_resource,volumesnapshots.snapshot.storage.k8s.io)
 kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/$(SNAPSHOTTER_VERSION)/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
 kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/$(SNAPSHOTTER_VERSION)/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
+kubectl -n kube-system wait --for=condition=ready pod -l app=snapshot-controller --timeout=60s || (echo "Snapshot controller not ready" && exit 1)

95-97: 🛠️ Refactor suggestion

Add readiness check for metrics-server

Replace the sleep command with a proper readiness check.

 @make k8s/metrics/metrics-server/deploy
 helm upgrade --install --set args={--kubelet-insecure-tls} metrics-server metrics-server/metrics-server -n kube-system
-sleep $(K8S_SLEEP_DURATION_FOR_WAIT_COMMAND)
+kubectl wait --for=condition=available deployment/metrics-server -n kube-system --timeout=60s || (echo "metrics-server not ready" && exit 1)
🧹 Nitpick comments (3)
Makefile.d/k8s.mk (2)

Line range hint 135-156: Improved deployment process with manifest dependency.

The k8s/vald/deploy target now properly depends on manifest generation and includes comprehensive resource deployment with error handling.

Consider adding a comment explaining why errors are permitted:

-	@echo "Permitting error because there's some cases nothing to apply"
+	@echo "Note: Errors are permitted as some resources might already exist or be unavailable in certain configurations"

Line range hint 239-270: Enhanced readreplica deployment with proper sequencing.

The k8s/vald-readreplica/deploy target includes several improvements:

  • Proper dependency on base deployment
  • Gateway cleanup and redeployment
  • Pod status verification and readiness checks

However, there's a potential race condition in the pod deletion wait.

Consider adding a verification step before proceeding with gateway deletion:

 	kubectl delete -f $(TEMP_DIR)/vald/templates/gateway || true
 	kubectl delete -f $(TEMP_DIR)/vald/templates/gateway/lb || true
+	# Verify no gateway pods are in terminating state before proceeding
+	while kubectl get pods -l app=vald-lb-gateway 2>/dev/null | grep Terminating; do
+		echo "Waiting for gateway pods to terminate..."
+		sleep 5
+	done
 	kubectl wait --for=delete pod -l app=vald-lb-gateway --timeout=600s
pkg/agent/core/ngt/service/ngt.go (1)

890-892: Add test coverage for read replica functionality.

The read replica handling code is not covered by tests. Consider adding test cases to verify the behavior.

Would you like me to help generate test cases for:

  1. Verifying that write operations are blocked on read replicas
  2. Ensuring read operations work correctly on read replicas
  3. Testing the early return in the Start method
🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 891-892: pkg/agent/core/ngt/service/ngt.go#L891-L892
Added lines #L891 - L892 were not covered by tests

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3358169 and 75d1062.

⛔ Files ignored due to path filters (1)
  • hack/docker/gen/main.go is excluded by !**/gen/**
📒 Files selected for processing (17)
  • .github/actions/e2e-deploy-vald-readreplica/action.yaml (1 hunks)
  • .github/actions/e2e-deploy-vald/action.yaml (2 hunks)
  • .github/actions/setup-e2e/action.yaml (2 hunks)
  • .github/helm/values/values-readreplica.yaml (2 hunks)
  • .github/workflows/_detect-ci-container.yaml (1 hunks)
  • .github/workflows/e2e.yaml (2 hunks)
  • .gitignore (1 hunks)
  • Makefile.d/e2e.mk (2 hunks)
  • Makefile.d/k8s.mk (5 hunks)
  • Makefile.d/kind.mk (2 hunks)
  • Makefile.d/minikube.mk (2 hunks)
  • Makefile.d/tools.mk (1 hunks)
  • charts/vald/templates/_helpers.tpl (1 hunks)
  • dockers/ci/base/Dockerfile (2 hunks)
  • dockers/dev/Dockerfile (1 hunks)
  • k8s/debug/kind/e2e.yaml (1 hunks)
  • pkg/agent/core/ngt/service/ngt.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (12)
  • dockers/ci/base/Dockerfile
  • Makefile.d/minikube.mk
  • k8s/debug/kind/e2e.yaml
  • .github/actions/e2e-deploy-vald/action.yaml
  • Makefile.d/tools.mk
  • Makefile.d/e2e.mk
  • dockers/dev/Dockerfile
  • .github/actions/setup-e2e/action.yaml
  • .github/actions/e2e-deploy-vald-readreplica/action.yaml
  • charts/vald/templates/_helpers.tpl
  • .github/helm/values/values-readreplica.yaml
  • .gitignore
🧰 Additional context used
🪛 GitHub Check: codecov/patch
pkg/agent/core/ngt/service/ngt.go

[warning] 891-892: pkg/agent/core/ngt/service/ngt.go#L891-L892
Added lines #L891 - L892 were not covered by tests

🔇 Additional comments (8)
.github/workflows/_detect-ci-container.yaml (1)

44-44: LGTM: Improved tag extraction by removing double quotes.

The addition of sed -e 's/"//g' ensures clean Docker image tag values by removing any double quotes, which is a good practice for tag handling.

.github/workflows/e2e.yaml (2)

338-338: Extended index creation wait duration for better reliability.

The wait duration for index creation has been increased from 3 minutes to 10 minutes. This change provides better reliability for the test by allowing more time for the index creation process to complete.


318-318: Infrastructure change: Switched from minikube to kind.

The change from require_minikube to require_kind aligns with the PR objective to fix the minikube start failure. Using Kind (Kubernetes in Docker) is a good choice as it's more lightweight and faster than minikube.

Run the following script to verify the infrastructure change is consistent across the codebase:

✅ Verification successful

The switch from minikube to kind is consistent across the codebase

The verification shows that there are no remaining references to minikube in any YAML files, while kind is properly configured in the codebase, particularly in:

  • k8s/debug/kind/e2e.yaml - Contains the kind cluster configuration for e2e tests
  • k8s/debug/kind/config.yaml - Contains the kind cluster configuration
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the switch from minikube to kind is consistent.

# Test 1: Check for any remaining minikube references
echo "Checking for remaining minikube references..."
rg -i "minikube" --type yaml

# Test 2: Verify kind configuration
echo "Verifying kind configuration..."
rg -i "kind" --type yaml

Length of output: 67507

Makefile.d/k8s.mk (2)

Line range hint 114-134: LGTM: Well-structured manifest generation target.

The new k8s/vald/manifests target properly generates Kubernetes manifests using Helm template with appropriate image configurations and CRDs.


273-273: LGTM: Proper dependency for readreplica deletion.

The k8s/vald-readreplica/delete target correctly depends on k8s/vald/delete ensuring proper cleanup order.

pkg/agent/core/ngt/service/ngt.go (1)

890-892: LGTM! Early return for read replicas prevents write operations.

The early return for read replicas is a good addition that prevents unnecessary index creation operations on read-only instances.

Run the following script to verify the read replica handling is consistent across the codebase:

✅ Verification successful

Early return for read replicas is consistently implemented across write operations

The early return for read replicas in the Start method is part of a consistent pattern of preventing write operations on read replicas. The codebase shows that:

  • The RegenerateIndexes method also has the same read replica check and returns ErrWriteOperationToReadReplica
  • Other write operations like Insert, Update, Delete, etc. are protected by the IsFlushing() check which prevents concurrent modifications
  • The read replica check is appropriately placed at the start of methods that modify the index
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify read replica handling across write operations
# Test: Search for write operations in read replica context

# Check for write operation methods that should handle read replicas
rg -A 5 'func.*\b(Insert|Update|Delete|Create|Save|Regenerate).*\('

# Check for isReadReplica checks
rg 'isReadReplica'

Length of output: 84523

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 891-892: pkg/agent/core/ngt/service/ngt.go#L891-L892
Added lines #L891 - L892 were not covered by tests

Makefile.d/kind.mk (2)

87-93: 🛠️ Refactor suggestion

Improve CSI driver installation robustness

The CSI driver installation process needs several improvements:

  1. Version should be a variable
  2. Error handling is missing
  3. Cleanup might fail silently
+CSI_DRIVER_VERSION ?= v1.15.0
+CSI_TEMP_DIR := $(shell mktemp -d)
+
+.PHONY: csi/cleanup
+csi/cleanup:
+	@if [ -d "$(CSI_TEMP_DIR)" ]; then \
+		rm -rf "$(CSI_TEMP_DIR)" || (echo "Failed to cleanup CSI temp directory" >&2; exit 1); \
+	fi

-mkdir -p $(TEMP_DIR)/csi-driver-hostpath \
-	&& curl -fsSL https://github.com/kubernetes-csi/csi-driver-host-path/archive/refs/tags/v1.15.0.tar.gz | tar zxf - -C $(TEMP_DIR)/csi-driver-hostpath --strip-components 1 \
-	&& cd $(TEMP_DIR)/csi-driver-hostpath \
-	&& deploy/kubernetes-latest/deploy.sh \
-	&& kubectl apply -f examples/csi-storageclass.yaml \
-	&& kubectl apply -f examples/csi-pvc.yaml \
-	&& rm -rf $(TEMP_DIR)/csi-driver-hostpath
+mkdir -p "$(CSI_TEMP_DIR)" || (echo "Failed to create temp directory" >&2; exit 1)
+curl -fsSL "https://github.com/kubernetes-csi/csi-driver-host-path/archive/refs/tags/$(CSI_DRIVER_VERSION).tar.gz" | \
+	tar zxf - -C "$(CSI_TEMP_DIR)" --strip-components 1 || (echo "Failed to download/extract CSI driver" >&2; $(MAKE) csi/cleanup; exit 1)
+cd "$(CSI_TEMP_DIR)" && \
+	./deploy/kubernetes-latest/deploy.sh || (echo "Failed to deploy CSI driver" >&2; $(MAKE) csi/cleanup; exit 1)
+kubectl apply -f "$(CSI_TEMP_DIR)/examples/csi-storageclass.yaml" || (echo "Failed to apply storage class" >&2; $(MAKE) csi/cleanup; exit 1)
+kubectl apply -f "$(CSI_TEMP_DIR)/examples/csi-pvc.yaml" || (echo "Failed to apply PVC" >&2; $(MAKE) csi/cleanup; exit 1)
+$(MAKE) csi/cleanup

Likely invalid or redundant comment.


17-17: Verify the snapshotter version compatibility

The external-snapshotter version is pinned to v8.2.0. Let's verify this version's compatibility with the cluster version.

✅ Verification successful

External-snapshotter version v8.2.0 is appropriate for the setup

The external-snapshotter version v8.2.0 is the latest stable version and is appropriate for use with the current setup. The implementation includes:

  • Installation of all necessary CRDs and RBAC configurations
  • Deployment of the snapshot controller
  • Integration with CSI driver host path v1.15.0
  • Proper configuration of storage classes and PVCs
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check external-snapshotter version compatibility
# Get the latest stable version and compare with current version
LATEST=$(curl -s https://api.github.com/repos/kubernetes-csi/external-snapshotter/releases/latest | jq -r '.tag_name')
echo "Current version: v8.2.0"
echo "Latest version: $LATEST"

# Check compatibility matrix in README
curl -s https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/README.md | grep -A 5 "Compatibility Matrix"

Length of output: 370


Script:

#!/bin/bash
# Let's check the Kubernetes version used in the project and related configurations
# First, check if Kind version or K8s version is defined in the Makefile
grep -r "KIND_VERSION\|KUBERNETES_VERSION" .

# Also check for any Kind configuration files that might specify K8s version
fd -e yaml -e yml | xargs grep -l "kind:\|kubernetes:"

# Let's also check the README for any version information
cat README.md

Length of output: 33060


Script:

#!/bin/bash
# Let's check the Kind configuration and Kubernetes version used
fd -e yaml -e yml kind | xargs cat

# Also check if there are any version constraints or compatibility info in kind.mk
cat Makefile.d/kind.mk

Length of output: 4769


Script:

#!/bin/bash
# Let's check the e2e.yaml file to see the Kubernetes version used
cat k8s/debug/kind/e2e.yaml

# Also check the csi-driver-host-path version compatibility
curl -s https://raw.githubusercontent.com/kubernetes-csi/csi-driver-host-path/v1.15.0/README.md | grep -A 5 "Compatibility"

Length of output: 881

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants