Add watch request timeout to prevent watch request hang #5732

xigang · 2024-10-23T11:51:35Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
When the federate-apiserver's watch request to the member cluster gets stuck, it will cause the watch request from the federated client to get stuck as well.

Which issue(s) this PR fixes:
Fixes #5672

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

karmada-bot · 2024-10-23T11:51:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ikaven1024 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/search/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2024-10-23T12:04:18Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 48.14815% with 14 lines in your changes missing coverage. Please review.

Project coverage is 41.58%. Comparing base (331145f) to head (b47f1d6).
Report is 35 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/search/proxy/store/multi_cluster_cache.go	48.14%	13 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5732      +/-   ##
==========================================
+ Coverage   40.90%   41.58%   +0.67%     
==========================================
  Files         650      655       +5     
  Lines       55182    55773     +591     
==========================================
+ Hits        22573    23191     +618     
+ Misses      31171    31076      -95     
- Partials     1438     1506      +68

Flag	Coverage Δ
unittests	`41.58% <48.14%> (+0.67%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

xigang · 2024-10-23T14:29:11Z

cc @XiShanYongYe-Chang @RainbowMango @ikaven1024 PTAL.

zhzhuang-zju · 2024-10-24T01:44:12Z

pkg/search/proxy/store/multi_cluster_cache.go

 			return nil, err
+		case <-time.After(30 * time.Second):
+			// If the watch request times out, return an error, and the client will retry.
+			return nil, fmt.Errorf("timeout waiting for watch for resource %v in cluster %q", gvr.String(), cluster)


@xigang Hi, if a watch request is hanging and causes a timeout, will the hanging watch request continue to exist in the subprocess?

@zhzhuang-zju Yes, there is this issue. When a watch request times out, the goroutine needs to be terminated.

Good point! Then that case we have to cancel the context passed to cache.Watch().

So this patch intends to terminate the hanging by raising an error after a period of time. Is this the idea?

Another question:
Before starting the Watch, we tried to get the cache of that cluster, I'm curious why this cache still exists even after the cluster is gone. Do we have a chance to clean the cache?

karmada/pkg/search/proxy/store/multi_cluster_cache.go

Lines 333 to 336 in e7b6513

cache := c.cacheForClusterResource(cluster, gvr)

if cache == nil {

continue

}

Another question: Before starting the Watch, we tried to get the cache of that cluster, I'm curious why this cache still exists even after the cluster is gone. Do we have a chance to clean the cache?

karmada/pkg/search/proxy/store/multi_cluster_cache.go

Lines 333 to 336 in e7b6513

cache := c.cacheForClusterResource(cluster, gvr)

if cache == nil {

continue

}

@RainbowMango When the member cluster goes offline but the Cluster resources in the control plane are not deleted, it can prevent the offline clusters in the ResourceRegistry from being removed, resulting in the resource cache being retained for a short time.

@xigang Hi, if a watch request is hanging and causes a timeout, will the hanging watch request continue to exist in the subprocess?

@RainbowMango @zhzhuang-zju Fixed, please take a look.

xigang · 2024-10-28T02:18:22Z

/retest

ikaven1024 · 2024-10-28T03:25:14Z

pkg/search/proxy/store/multi_cluster_cache.go

 			return nil, err
+		case <-time.After(30 * time.Second):


It seems wait 30s for each cluster. Should we wait all clusters paralleled?

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

@ikaven1024 There’s no issue here; as long as a single cache.Watch times out, the Watch request will return with an error and end. There’s no problem with that here.😄

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

@ikaven1024 There’s no issue here; as long as a single cache.Watch times out, the Watch request will return with an error and end. There’s no problem with that here.😄

While if every cluster create watching takes 20s, not timeout, the total time spends 20s * N.

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

@ikaven1024 There’s no issue here; as long as a single cache.Watch times out, the Watch request will return with an error and end. There’s no problem with that here.😄

While if every cluster create watching takes 20s, not timeout, the total time spends 20s * N.

@ikaven1024 I understand, but the code in the current release also traverses all the clusters to create watches, which takes N*M time.

ikaven1024 · 2024-10-28T03:28:48Z

pkg/search/proxy/store/multi_cluster_cache.go

+		errChan := make(chan error, 1)
+
+		go func(cluster string) {
+			w, err := cache.Watch(ctx, options)


If this watcher is created after 30s, then it seems no way to stop it, is it leak?

If the watcher times out after 30 seconds during creation, it will trigger a time.After timeout, return an error, and call cancel to stop the watcher goroutine.

karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go

Line 263 in e65e993

defer func() { cancel() }()

…hang Signed-off-by: xigang <wangxigang2014@gmail.com>

xigang · 2024-11-06T06:32:14Z

@RainbowMango @ikaven1024 @zhzhuang-zju Our production environment is running without issues; could you take a look to see if the code can be merged?

zhzhuang-zju · 2024-11-07T02:41:17Z

karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go

Lines 262 to 268 in 0cc294f

    
           ctx, cancel := context.WithTimeout(ctx, timeout) 
        
           defer func() { cancel() }() 
        
           watcher, err := rw.Watch(ctx, &opts) 
        
           if err != nil { 
        
           	scope.err(err, w, req) 
        
           	return 
        
           }

~~@xigang The context used when establishing a watch request on the cache also has a timeout. Can we directly utilize this context to prevent the watch request from hanging?~~

karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go

Lines 254 to 260 in 0cc294f

    
           timeout := time.Duration(0) 
        
           if opts.TimeoutSeconds != nil { 
        
           	timeout = time.Duration(*opts.TimeoutSeconds) * time.Second 
        
           } 
        
           if timeout == 0 && minRequestTimeout > 0 { 
        
           	timeout = time.Duration(float64(minRequestTimeout) * (rand.Float64() + 1.0)) 
        
           }

karmada/cmd/karmada-search/app/karmada-search.go

Lines 210 to 215 in 0cc294f

    
           proxyCtl, err = proxy.NewController(proxy.NewControllerOption{ 
        
           	RestConfig:        serverConfig.ClientConfig, 
        
           	RestMapper:        restMapper, 
        
           	KubeFactory:       serverConfig.SharedInformerFactory, 
        
           	KarmadaFactory:    factory, 
        
           	MinRequestTimeout: time.Second * time.Duration(serverConfig.Config.MinRequestTimeout),

~~The timeout for the context is actually initialized when setting up search.Config, with a value of 1800 seconds. It might be worth considering setting this to 30 seconds instead.~~

This context represents the sum of execution times of multiple functions, not only watch

zhzhuang-zju · 2024-11-07T03:05:45Z

pkg/search/proxy/store/multi_cluster_cache.go

+				select {
+				case errChan <- fmt.Errorf("failed to start watch for resource %v in cluster %q: %v", gvr.String(), cluster, err):
+				case <-ctx.Done():
+				}


Can case <-ctx.Done() be reached? And is it necessary to respond to the ctx.Done event?

The main purpose here is to handle edge cases. There won’t be any issues even if there is no case <-ctx.Done().

If removing the case <-ctx.Done() statement does not affect the actual functionality, it is recommended to remove it to improve code readability.

xigang · 2024-11-07T07:44:39Z

karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go

Lines 262 to 268 in 0cc294f

ctx, cancel := context.WithTimeout(ctx, timeout)

defer func() { cancel() }()

watcher, err := rw.Watch(ctx, &opts)

if err != nil {

scope.err(err, w, req)

return

}

~~@xigang The context used when establishing a watch request on the cache also has a timeout. Can we directly utilize this context to prevent the watch request from hanging?~~

karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go

Lines 254 to 260 in 0cc294f

timeout := time.Duration(0)

if opts.TimeoutSeconds != nil {

timeout = time.Duration(*opts.TimeoutSeconds) * time.Second

}

if timeout == 0 && minRequestTimeout > 0 {

timeout = time.Duration(float64(minRequestTimeout) * (rand.Float64() + 1.0))

}

karmada/cmd/karmada-search/app/karmada-search.go

Lines 210 to 215 in 0cc294f

proxyCtl, err = proxy.NewController(proxy.NewControllerOption{

RestConfig: serverConfig.ClientConfig,

RestMapper: restMapper,

KubeFactory: serverConfig.SharedInformerFactory,

KarmadaFactory: factory,

MinRequestTimeout: time.Second * time.Duration(serverConfig.Config.MinRequestTimeout),

~~The timeout for the context is actually initialized when setting up search.Config, with a value of 1800 seconds. It might be worth considering setting this to 30 seconds instead.~~
This context represents the sum of execution times of multiple functions, not only watch

@zhzhuang-zju The timeout for client-go watch requests is around 10 minutes. The default value of minRequestTimeout is 1800 seconds, and if the client specifies a timeout, minRequestTimeout will not be used. What issues might arise if you set this Context for federation watch and all member clusters' Watch?

RainbowMango · 2024-11-07T07:53:42Z

Can someone tell me how to reproduce this issue? I'd like to reproduce and verify the patch on my side.

zhzhuang-zju · 2024-11-07T08:06:31Z

@zhzhuang-zju What issues might arise if you set this Context for federation watch and all member clusters' Watch?

Sorry, please ignore the comment. Initially, I thought about setting the timeout for the context to 30 seconds to solve the issue of the watch request hanging. And then, I realized that this approach was not appropriate, so I commented it out.

xigang · 2024-11-07T08:32:00Z

Can someone tell me how to reproduce this issue? I'd like to reproduce and verify the patch on my side.

In our scenario, a member cluster is taken offline, and all nodes within the cluster are shut down. At this point, the cluster name still exists in the ResourceRegistry, so the Pod Cacher also remains. Additionally, during the cluster offline, client Watch long connections continue to access the federate-apiserver.

As a result, the client’s Watch requests end up hanging.

karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 23, 2024

karmada-bot requested review from ikaven1024 and XiShanYongYe-Chang October 23, 2024 11:51

karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 23, 2024

xigang changed the title ~~Add timeout for watch requests to member clusters to prevent request …~~ Add watch request timeout to prevent watch request hang Oct 23, 2024

zhzhuang-zju reviewed Oct 24, 2024

View reviewed changes

xigang force-pushed the bugfix/watch branch 2 times, most recently from 21c10d3 to 5a55ca2 Compare October 27, 2024 02:50

ikaven1024 reviewed Oct 28, 2024

View reviewed changes

xigang force-pushed the bugfix/watch branch from 5a55ca2 to a9e3aa1 Compare October 29, 2024 07:30

Add timeout for watch requests to member clusters to prevent request …

b47f1d6

…hang Signed-off-by: xigang <wangxigang2014@gmail.com>

xigang force-pushed the bugfix/watch branch from a9e3aa1 to b47f1d6 Compare October 29, 2024 07:39

zhzhuang-zju reviewed Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add watch request timeout to prevent watch request hang #5732

Add watch request timeout to prevent watch request hang #5732

xigang commented Oct 23, 2024 •

edited by RainbowMango

Loading

karmada-bot commented Oct 23, 2024

codecov-commenter commented Oct 23, 2024 •

edited

Loading

xigang commented Oct 23, 2024

zhzhuang-zju Oct 24, 2024

xigang Oct 24, 2024

RainbowMango Oct 24, 2024

RainbowMango Oct 24, 2024

RainbowMango Oct 24, 2024

xigang Oct 27, 2024

xigang Oct 27, 2024 •

edited

Loading

xigang commented Oct 28, 2024

ikaven1024 Oct 28, 2024

xigang Oct 29, 2024 •

edited

Loading

ikaven1024 Oct 30, 2024

xigang Nov 6, 2024

ikaven1024 Oct 28, 2024

xigang Oct 28, 2024 •

edited

Loading

ikaven1024 Oct 28, 2024

xigang commented Nov 6, 2024

zhzhuang-zju commented Nov 7, 2024 •

edited

Loading

zhzhuang-zju Nov 7, 2024

xigang Nov 7, 2024

zhzhuang-zju Nov 7, 2024

xigang commented Nov 7, 2024

RainbowMango commented Nov 7, 2024

zhzhuang-zju commented Nov 7, 2024

xigang commented Nov 7, 2024 •

edited

Loading

	cache := c.cacheForClusterResource(cluster, gvr)
	if cache == nil {
	continue
	}

Add watch request timeout to prevent watch request hang #5732

Are you sure you want to change the base?

Add watch request timeout to prevent watch request hang #5732

Conversation

xigang commented Oct 23, 2024 • edited by RainbowMango Loading

karmada-bot commented Oct 23, 2024

codecov-commenter commented Oct 23, 2024 • edited Loading

Codecov Report

xigang commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xigang Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

xigang commented Oct 28, 2024

Choose a reason for hiding this comment

xigang Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xigang Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xigang commented Nov 6, 2024

zhzhuang-zju commented Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xigang commented Nov 7, 2024

RainbowMango commented Nov 7, 2024

zhzhuang-zju commented Nov 7, 2024

xigang commented Nov 7, 2024 • edited Loading

xigang commented Oct 23, 2024 •

edited by RainbowMango

Loading

codecov-commenter commented Oct 23, 2024 •

edited

Loading

xigang Oct 27, 2024 •

edited

Loading

xigang Oct 29, 2024 •

edited

Loading

xigang Oct 28, 2024 •

edited

Loading

zhzhuang-zju commented Nov 7, 2024 •

edited

Loading

xigang commented Nov 7, 2024 •

edited

Loading