This repository has been archived by the owner on Feb 12, 2022. It is now read-only.
Detect and recycle broken connections on non-GCM devices #62
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I apologize in advance for the length of this message, but I've tried to be as thorough as possible in documenting this issue. To make review easier, I've split the documentation into separate sections and have added a summary, so that you can read the details as required, but please note the "Testing" section in particular, as it details the painstaking testing I've done to ensure proper operation of the patch, which might alleviate the concern that this PR will somehow adversly affect Signal's operation.
Summary
Network changes can introduce prolonged periods of connectivity loss on GCM-disabled devices (anywhere from 20 minutes to a day or more). This PR, along with its sibling on the
Signal-Android
side, implement a couple of mechanism that detect network changes in particular and broken connections in general and recycle the connection in response. A related fix is also introduced, to keepSignalServiceMessageSender
up-to-date with the current message pipe at all times.Issue
When GCM is not available and persistent socket connections need to be maintained, network changes (e.g. switching from WiFi to cellular data) can result in prolonged loss of connectivity. To be more precise, the device is still able to send messages, but not receive responses and receipts, or place or receive any calls. While it is hard to be sure about the details of the failure mode, my theory is the following (see "Documentation" below for the observations this is based on):
When the connection changes, typically so does the device's IP address. Any existing sockets will therefore have the wrong source address, effectively left dangling at the source side, unless they're somehow destroyed. On some devices, this seems to be taken care of, by the
Netd
system daemon, so that the websocket is killed (and hence automatically recycled) as soon as the connection changes. On other devices, including mine and seemingly those of many other users too, this does not happen. The socket then remains dangling, until the error is belatedly detected, which in my case can take anywhere from 20 minutes (when I try to send messages, probably because the attempt to use the socket somehow brings about the detection of its state) to more than a day (when Signal is left undisturbed while its connection is broken). Once the error is detected, an exception is thrown and the socket is closed and recycled.While the socket is dangling, the device can still send messages, as it falls back to a separate connection, which seems to be established on the spot.
Patch
The current patch attempts to remedy this problem via a two-pronged approach: When GCM is not available, it equips the
MessageRetrievalService
with a receiver/callback (depending on Android version), that listens for connectivity changes and attempts to detect a change in the active network. When it does so, it signals the message retrieval thread to cycle and re-establish the connection. Since both this thread and the associated keep-alive sender thread are blocked for prolonged periods of time, Java's thread interrupt mechanism is used to ensure smooth operation.In addition to that, a secondary detection mechanism is implemented, based on the method proposed by @rkohrt in #49. The keep-alive sender keeps track of the response to the previous keep-alive. When it is found that there was none, the socket is recycled. This detection method, although perhaps not strictly necessary in the presence of the mechanism laid out above, is nevertheless more universal and robust, as it will respond to loss of connectivity, regardless of its cause. It also fits very nicely into the existing implementation and doesn't introduce any substantial overhead, or risk to Signal's normal operation. On the other hand, it is also much slower to respond, taking anywhere from 55 to 110 seconds, during which the device is unable to receive messages or calls. (While this might not seem like a big deal, one is more likely to need to message or call someone after a network switch, for instance after stepping out of the house, and having to wait for 1-2 minutes can be bothersome.)
@rkohrt, I've include you as an author in the relevant commit; let me know if you have any objections and I'll remove you.
The general idea is that the first mechanism will ensure a quick response most of the time, while the second will step in, if a connectivity change is missed for some reason (which is not entirely unlikely, as properly detecting a change of the active connection through the callbacks is not very straightforward).
Fix to keep
SignalServiceMessageSender
up-to-dateA separate, but related commit attempts to address the following issue: when the message retrieval thread recycles the pipe, for whatever reason it might do so, the
SignalServiceMessageSender
is not updated with the new pipe, until it happens to be re-injected somewhere. This has the result that messages sent to establish or answer a call after a network change for instance, detect a dangling socket and have to fall back to a separate connection, introducing a 10s delay for each message sent. The proposed fix is simple and fits naturally with the existing implementation.Testing
I've tested all three (counting two separate code paths for the listening mechanism) methods separately, making every effort to ensure that the socket is cycled in an orderly fashion, without needless socket recycling and without inadvertent disruption of the normal operation during periods of normal connectivity. I've gone through the logs of several messaging sessions with intersperesed network changes, on two separated devices and have noticed nothing out of the ordinary. In every case the connection was promptly recycled (or after 1-2 minutes, when testing the secondary mechanism on its own) and no alarming messages were logged.
As a stress test, I had a script toggle the WiFi on two separate devices on and off approximately every 20s, for the better part of a day and messages were sent from one device to the other every so often. During that time, I also switched cellular data off for a couple of periods, to cover the case of switching to no network. This resulted in at least a couple of thousand of switches and every time messages and read receipts were sent or received within seconds. In the end Signal hadn't crashed and was functioning normally. Note that one of the two tested devices didn't have the problem, (because of
Netd
) but the manual recycling of the connection didn't cause any discernible problems.I've also tested the proposed fix for keeping
SignalServiceMessageSender
up-to-date with the current pipe and it works fine and while I haven't managed to perform any tests on GCM-enabled devices yet, they should experience no adverse effect whatsoever (especially since most of the functionality in the patch is disabled when GCM is available). If necessary for acceptance of this PR, I will try to find a device to test it on.Documentation
The following logs are from a session, where I freshly started Signal, sent a message over WiFi, then immediately switched WiFi off and fell back to cellular data and then sent another. Some commentary is included.
Initially, at some point after device startup, the
ConnectivityService
switches to WiFi. Note the IPv6 addresses.I then send the message, with everything working smoothly and disable WiFi seconds later:
Soon after that
ConnectivityService
enables cellular data. Note the different addresses (now IPv4):When I try to send another message, a problem is detected and an alternate delivery method is used:
Nevertheless the pipe (persistent connection) is not recycled, so that Signal happily goes on pinging a dead connection (note the lack of response from the server) until, after some time, the failure is detected and the socket recycled. All pending receipts and messages are then immediately received through the new connection:
Notice that
Netd
, only logs an error message at the time the WiFi connection goes down:Other such messages are logged at other times, pointing to a potential misconfiguration of the Android system as the source of the problem (although see also below):
On another device, which uses precisely the same LineageOS distribution as mine (although a different port), but which doesn't display the problem, the
Netd
daemon kills the dangling sockets as soon as the WiFi connection goes down:This prompts Signal to recycle the socket immediately, so that no disruption is experienced. The case is similar when re-enabling WiFi:
Note that
Netd
seems to print similar, although much fewer, error messages in this case as well.