-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Presence] Huge spike in CPU usage/Federation traffic approximately every 25 minutes #15878
Comments
I think this is essentially a dupe of #9478 - I am going to close this in favor of that. |
I postulate that this is different and more specific(ignoring that #9478 is actually a meta issue similar to where this is housed). I believe that this specific behavior is actually caused by the prefill of presence that is setup on startup of Synapse. Look here at part of the synapse/synapse/handlers/presence.py Lines 668 to 691 in b07b14b
Slightly further down(same function) is where a looping call is setup to run every 5 seconds after waiting 30 seconds(so it doesn't influence the rest of Synapse starting up).: synapse/synapse/handlers/presence.py Lines 723 to 735 in b07b14b
After all this initially runs, all the timers are based on this timer, then are reset when timeouts are handled(in this instance and for now) in about 25 minutes, leading to a perpetual repeat of this situation. I speculate that adjusting the initial set of this timer to spread it over the 25 minute interval(evenly or randomly) would break this load up and prevent the spikes. Subsequent timers will then be also spread out. |
Fair enough! Thanks for adding an explanation, that makes what you are getting at much clearer - I will re-open. |
These images/screenshots of metrics are for a single user(for context). So, what happens when the scale of local users climbs to 100's of thousands? In theory, an exponential increase in traffic. I'm envisioning something similar to a deduplicating bucket like system with an arbitrary timer(say 5 or 10 seconds) that can accumulate presence data for multiple users(or update it if some change came along in that 5 seconds) to 'append' to outgoing federation traffic. Then, if another need to send data over federation comes in, it can just check that bucket and send it. If no other request has come in within that arbitrary timer, then go ahead and send the presence data by itself. Or some such. I'm open to clever ideas, if anyone would like to suggest something. |
There is a timeout for when to send a ping over federation every 25 minutes that keeps a user from being marked 'offline' before the 30 minute timeout hits.
This appears to be the replication notifier system ramping up and queueing a bunch of federation sending requests over approximately 1 minute worth of time(give or take a few seconds)
Images
There is a database hit during this to
get_current_hosts_in_room()
, I'm not personally convinced it's contributing to the seriousness of this situation(but included here for completeness).Images:
UPDATE: Additional information from the other side of the slash in the title
The large spike in traffic caused by queueing and then sending all those requests looks like this:
The text was updated successfully, but these errors were encountered: