Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: fix a flaky test #8440

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

cjc7373
Copy link
Contributor

@cjc7373 cjc7373 commented Nov 11, 2024

Today I've encountered a very weird test fail (also this run).

log

• [PANICKED] [0.874 seconds]
Cluster Controller [AfterEach] cluster status test cluster conditions when cluster definition non-exist
  [AfterEach] /home/runner/work/kubeblocks/kubeblocks/controllers/apps/cluster_controller_test.go:123
  [It] /home/runner/work/kubeblocks/kubeblocks/controllers/apps/cluster_controller_test.go:929

  Timeline >>
  STEP: clean resources @ 11/11/24 06:36:09.526
  STEP: clear resources Cluster @ 11/11/24 06:36:09.526
  STEP: clear resources ClusterDefinition @ 11/11/24 06:36:09.529
  STEP: clear resources ComponentDefinition @ 11/11/24 06:36:09.532
  STEP: clear resources ComponentVersion @ 11/11/24 06:36:09.535
  STEP: clear resources Component @ 11/11/24 06:36:09.54
  STEP: clear resources PersistentVolumeClaim @ 11/11/24 06:36:09.616
  STEP: clear resources Pod @ 11/11/24 06:36:09.626
  STEP: clear resources Backup @ 11/11/24 06:36:09.629
  STEP: clear resources BackupPolicy @ 11/11/24 06:36:09.647
  STEP: clear resources VolumeSnapshot @ 11/11/24 06:36:09.7
  STEP: clear resources Service @ 11/11/24 06:36:09.705
  STEP: clear resources BackupPolicyTemplate @ 11/11/24 06:36:09.712
  STEP: clear resources ActionSet @ 11/11/24 06:36:09.716
  STEP: clear resources StorageClass @ 11/11/24 06:36:09.722
  STEP: Create a componentDefinition obj @ 11/11/24 06:36:09.735
  STEP: Create a componentVersion obj @ 11/11/24 06:36:09.768
  STEP: Create a clusterDefinition obj @ 11/11/24 06:36:09.799
  STEP: Create a bpt obj @ 11/11/24 06:36:09.846
  STEP: create actionSet @ 11/11/24 06:36:09.846
  STEP: Creating a BackupPolicyTemplate @ 11/11/24 06:36:09.9
  STEP: Wait objects available @ 11/11/24 06:36:09.912
  STEP: create a cluster with cluster definition non-exist @ 11/11/24 06:36:09.945
  STEP: check conditions @ 11/11/24 06:36:09.958
  STEP: clean resources @ 11/11/24 06:36:09.989
  STEP: clear resources Cluster @ 11/11/24 06:36:09.989
  STEP: clear resources ClusterDefinition @ 11/11/24 06:36:10.032
  STEP: clear resources ComponentDefinition @ 11/11/24 06:36:10.058
  STEP: clear resources ComponentVersion @ 11/11/24 06:36:10.136
  STEP: clear resources Component @ 11/11/24 06:36:10.172
  STEP: clear resources PersistentVolumeClaim @ 11/11/24 06:36:10.198
  STEP: clear resources Pod @ 11/11/24 06:36:10.201
  STEP: clear resources Backup @ 11/11/24 06:36:10.203
  STEP: clear resources BackupPolicy @ 11/11/24 06:36:10.206
  STEP: clear resources VolumeSnapshot @ 11/11/24 06:36:10.263
  STEP: clear resources Service @ 11/11/24 06:36:10.27
  STEP: clear resources BackupPolicyTemplate @ 11/11/24 06:36:10.277
  STEP: clear resources ActionSet @ 11/11/24 06:36:10.285
  STEP: clear resources StorageClass @ 11/11/24 06:36:10.292
  STEP: clean resources @ 11/11/24 06:36:10.295
  STEP: clear resources Cluster @ 11/11/24 06:36:10.295
  STEP: clear resources ClusterDefinition @ 11/11/24 06:36:10.304
  STEP: clear resources ComponentDefinition @ 11/11/24 06:36:10.308
  STEP: clear resources ComponentVersion @ 11/11/24 06:36:10.312
  STEP: clear resources Component @ 11/11/24 06:36:10.316
  STEP: clear resources PersistentVolumeClaim @ 11/11/24 06:36:10.345
  STEP: clear resources Pod @ 11/11/24 06:36:10.351
  STEP: clear resources Backup @ 11/11/24 06:36:10.357
  STEP: clear resources BackupPolicy @ 11/11/24 06:36:10.364
  STEP: clear resources VolumeSnapshot @ 11/11/24 06:36:10.383
  STEP: clear resources Service @ 11/11/24 06:36:10.389
  [PANICKED] in [AfterEach] - /home/runner/go/pkg/mod/github.com/onsi/gomega@v1.31.0/internal/async_assertion.go:321 @ 11/11/24 06:36:10.4
  << Timeline

  [PANICKED] Test Panicked
  In [AfterEach] at: /home/runner/go/pkg/mod/github.com/onsi/gomega@v1.31.0/internal/async_assertion.go:321 @ 11/11/24 06:36:10.4

  expected DeletionTimestamp is not nil, obj: {"metadata":{"name":"kubernetes","namespace":"default","uid":"fc241288-27cc-4897-bab9-fb2cf0d3e547","resourceVersion":"2397","creationTimestamp":"2024-11-11T06:36:10Z","labels":{"component":"apiserver","provider":"kubernetes"},"managedFields":[{"manager":"kube-apiserver","operation":"Update","apiVersion":"v1","time":"2024-11-11T06:36:10Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:component":{},"f:provider":{}}},"f:spec":{"f:clusterIP":{},"f:internalTrafficPolicy":{},"f:ipFamilyPolicy":{},"f:ports":{".":{},"k:{\"port\":443,\"protocol\":\"TCP\"}":{".":{},"f:name":{},"f:port":{},"f:protocol":{},"f:targetPort":{}}},"f:sessionAffinity":{},"f:type":{}}}}]},"spec":{"ports":[{"name":"https","protocol":"TCP","port":443,"targetPort":42583}],"clusterIP":"10.0.0.1","clusterIPs":["10.0.0.1"],"type":"ClusterIP","sessionAffinity":"None","ipFamilies":["IPv4"],"ipFamilyPolicy":"SingleStack","internalTrafficPolicy":"Cluster"},"status":{"loadBalancer":{}}}

  Full Stack Trace
    github.com/onsi/gomega/internal.(*AsyncAssertion).buildActualPoller.func3.1()
    	/home/runner/go/pkg/mod/github.com/onsi/gomega@v1.31.0/internal/async_assertion.go:321 +0x1c5
    panic({0x339c2c0?, 0xc0054e6ca0?})
    	/opt/hostedtoolcache/go/1.21.13/x64/src/runtime/panic.go:914 +0x21f
    github.com/apecloud/kubeblocks/pkg/testutil/apps.ClearResourcesWithRemoveFinalizerOption[...].func1()
    	/home/runner/work/kubeblocks/kubeblocks/pkg/testutil/apps/common_util.go:338 +0x554
    reflect.Value.call({0x33a8f40?, 0xc004af79e0?, 0x2540be400?}, {0x394aed3, 0x4}, {0xc004f5f920, 0x1, 0x76?})
    	/opt/hostedtoolcache/go/1.21.13/x64/src/reflect/value.go:596 +0xce7
    reflect.Value.Call({0x33a8f40?, 0xc004af79e0?, 0xc00006e000?}, {0xc004f5f920?, 0x6cf1db0d3b?, 0x[380](https://github.com/apecloud/kubeblocks/actions/runs/11773368367/job/32790184670#step:6:381)ae40?})
    	/opt/hostedtoolcache/go/1.21.13/x64/src/reflect/value.go:380 +0xb9
    github.com/onsi/gomega/internal.(*AsyncAssertion).buildActualPoller.func3()
    	/home/runner/go/pkg/mod/github.com/onsi/gomega@v1.31.0/internal/async_assertion.go:325 +0x11f
    github.com/onsi/gomega/internal.(*AsyncAssertion).match(0xc0003ae9a0, {0x3d8d1d8?, 0x555c3e0}, 0x1, {0x0, 0x0, 0x0})
    	/home/runner/go/pkg/mod/github.com/onsi/gomega@v1.31.0/internal/async_assertion.go:398 +0x179
    github.com/onsi/gomega/internal.(*AsyncAssertion).Should(0xc0003ae9a0, {0x3d8d1d8, 0x555c3e0}, {0x0, 0x0, 0x0})
    	/home/runner/go/pkg/mod/github.com/onsi/gomega@v1.31.0/internal/async_assertion.go:145 +0x86
    github.com/apecloud/kubeblocks/pkg/testutil/apps.ClearResourcesWithRemoveFinalizerOption[...](0x5527a20, 0x1, 0x1?, {0xc0054e6a10, 0x1, 0x1})
    	/home/runner/work/kubeblocks/kubeblocks/pkg/testutil/apps/common_util.go:352 +0x565
    github.com/apecloud/kubeblocks/controllers/apps.glob..func1.2()
    	/home/runner/work/kubeblocks/kubeblocks/controllers/apps/cluster_controller_test.go:110 +0x55e
    github.com/apecloud/kubeblocks/controllers/apps.glob..func1.4()
    	/home/runner/work/kubeblocks/kubeblocks/controllers/apps/cluster_controller_test.go:124 +0x13

At first I thought it was because of a stale read. The failed code is:

g.Expect(testCtx.Cli.DeleteAllOf(testCtx.Ctx, PT(&obj), opts...)).ShouldNot(gomega.HaveOccurred())
g.Expect(testCtx.Cli.List(testCtx.Ctx, PL(&objList), listOptions...)).Should(gomega.Succeed())
items := reflect.ValueOf(&objList).Elem().FieldByName("Items").Interface().([]T)
for _, obj := range items {
	pobj := PT(&obj)
	if pobj.GetDeletionTimestamp().IsZero() {
		d, _ := json.Marshal(pobj)
		panic("expected DeletionTimestamp is not nil, obj: " + string(d))
	}

Although testCtx.cli here won't use client-go's cache, it still suffers from apiserver's stale cache. But after some investigation, I know that if resoureceVersion parameter is not set in the list request, it is guaranteed that returned data will be at the most recent resource version (i.e. read directly from etcd).

So I suspect it was because we have deleted the kubernetes service, and it would be recreated by apiserver after deleted. This behaviour can be verified by a simple test:

FIt("debug", func() {
  inNS := client.InNamespace(testCtx.DefaultNamespace)
  oldSvc := &corev1.Service{}
  nn := client.ObjectKey{Namespace: "default", Name: "kubernetes"}
  Expect(testCtx.Cli.Get(testCtx.Ctx, nn, oldSvc)).Should(Succeed())
  fmt.Println(oldSvc)
  testapps.ClearResourcesWithRemoveFinalizerOption(&testCtx, generics.ServiceSignature, true, inNS)
  Eventually(testapps.CheckObj(&testCtx, nn, func(g Gomega, svc *corev1.Service) {
	  g.Expect(svc.DeletionTimestamp.IsZero()).Should(BeTrue())
	  g.Expect(svc.UID).ShouldNot(Equal(oldSvc.UID))
	  fmt.Println(svc)
  })).Should(Succeed())
})

So here's the fix.

Copy link

codecov bot commented Nov 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.40%. Comparing base (21fa706) to head (f7dd0e2).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8440      +/-   ##
==========================================
+ Coverage   60.97%   61.40%   +0.43%     
==========================================
  Files         351      351              
  Lines       41764    41855      +91     
==========================================
+ Hits        25465    25702     +237     
+ Misses      13987    13840     -147     
- Partials     2312     2313       +1     
Flag Coverage Δ
unittests 61.40% <ø> (+0.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the size/XS Denotes a PR that changes 0-9 lines. label Nov 11, 2024
@cjc7373 cjc7373 marked this pull request as ready for review November 11, 2024 11:02
@cjc7373 cjc7373 requested review from leon-inf, Y-Rookie and a team as code owners November 11, 2024 11:02
Copy link
Contributor

@zjx20 zjx20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! These errors typically occur when a controller creates a new resource after the test code calling DeleteAllOf(), causing subsequent checks to fail. Such errors are very difficult to debug. This PR looks good to me, but I would like to ask you to check other places where ClearResources* is called to ensure they all include the ml selector to prevent similar errors.

controllers/apps/cluster_controller_test.go Show resolved Hide resolved
@apecloud-bot apecloud-bot added the approved PR Approved Test label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved PR Approved Test size/XS Denotes a PR that changes 0-9 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants