Detect and recycle broken connections on non-GCM devices #62

dpapavas · 2018-09-30T12:16:15Z

I apologize in advance for the length of this message, but I've tried to be as thorough as possible in documenting this issue. To make review easier, I've split the documentation into separate sections and have added a summary, so that you can read the details as required, but please note the "Testing" section in particular, as it details the painstaking testing I've done to ensure proper operation of the patch, which might alleviate the concern that this PR will somehow adversly affect Signal's operation.

Summary

Network changes can introduce prolonged periods of connectivity loss on GCM-disabled devices (anywhere from 20 minutes to a day or more). This PR, along with its sibling on the Signal-Android side, implement a couple of mechanism that detect network changes in particular and broken connections in general and recycle the connection in response. A related fix is also introduced, to keep SignalServiceMessageSender up-to-date with the current message pipe at all times.

Issue

When GCM is not available and persistent socket connections need to be maintained, network changes (e.g. switching from WiFi to cellular data) can result in prolonged loss of connectivity. To be more precise, the device is still able to send messages, but not receive responses and receipts, or place or receive any calls. While it is hard to be sure about the details of the failure mode, my theory is the following (see "Documentation" below for the observations this is based on):

When the connection changes, typically so does the device's IP address. Any existing sockets will therefore have the wrong source address, effectively left dangling at the source side, unless they're somehow destroyed. On some devices, this seems to be taken care of, by the Netd system daemon, so that the websocket is killed (and hence automatically recycled) as soon as the connection changes. On other devices, including mine and seemingly those of many other users too, this does not happen. The socket then remains dangling, until the error is belatedly detected, which in my case can take anywhere from 20 minutes (when I try to send messages, probably because the attempt to use the socket somehow brings about the detection of its state) to more than a day (when Signal is left undisturbed while its connection is broken). Once the error is detected, an exception is thrown and the socket is closed and recycled.

While the socket is dangling, the device can still send messages, as it falls back to a separate connection, which seems to be established on the spot.

Patch

The current patch attempts to remedy this problem via a two-pronged approach: When GCM is not available, it equips the MessageRetrievalService with a receiver/callback (depending on Android version), that listens for connectivity changes and attempts to detect a change in the active network. When it does so, it signals the message retrieval thread to cycle and re-establish the connection. Since both this thread and the associated keep-alive sender thread are blocked for prolonged periods of time, Java's thread interrupt mechanism is used to ensure smooth operation.

In addition to that, a secondary detection mechanism is implemented, based on the method proposed by @rkohrt in #49. The keep-alive sender keeps track of the response to the previous keep-alive. When it is found that there was none, the socket is recycled. This detection method, although perhaps not strictly necessary in the presence of the mechanism laid out above, is nevertheless more universal and robust, as it will respond to loss of connectivity, regardless of its cause. It also fits very nicely into the existing implementation and doesn't introduce any substantial overhead, or risk to Signal's normal operation. On the other hand, it is also much slower to respond, taking anywhere from 55 to 110 seconds, during which the device is unable to receive messages or calls. (While this might not seem like a big deal, one is more likely to need to message or call someone after a network switch, for instance after stepping out of the house, and having to wait for 1-2 minutes can be bothersome.)

@rkohrt, I've include you as an author in the relevant commit; let me know if you have any objections and I'll remove you.

The general idea is that the first mechanism will ensure a quick response most of the time, while the second will step in, if a connectivity change is missed for some reason (which is not entirely unlikely, as properly detecting a change of the active connection through the callbacks is not very straightforward).

Fix to keep `SignalServiceMessageSender` up-to-date

A separate, but related commit attempts to address the following issue: when the message retrieval thread recycles the pipe, for whatever reason it might do so, the SignalServiceMessageSender is not updated with the new pipe, until it happens to be re-injected somewhere. This has the result that messages sent to establish or answer a call after a network change for instance, detect a dangling socket and have to fall back to a separate connection, introducing a 10s delay for each message sent. The proposed fix is simple and fits naturally with the existing implementation.

Testing

I've tested all three (counting two separate code paths for the listening mechanism) methods separately, making every effort to ensure that the socket is cycled in an orderly fashion, without needless socket recycling and without inadvertent disruption of the normal operation during periods of normal connectivity. I've gone through the logs of several messaging sessions with intersperesed network changes, on two separated devices and have noticed nothing out of the ordinary. In every case the connection was promptly recycled (or after 1-2 minutes, when testing the secondary mechanism on its own) and no alarming messages were logged.

As a stress test, I had a script toggle the WiFi on two separate devices on and off approximately every 20s, for the better part of a day and messages were sent from one device to the other every so often. During that time, I also switched cellular data off for a couple of periods, to cover the case of switching to no network. This resulted in at least a couple of thousand of switches and every time messages and read receipts were sent or received within seconds. In the end Signal hadn't crashed and was functioning normally. Note that one of the two tested devices didn't have the problem, (because of Netd) but the manual recycling of the connection didn't cause any discernible problems.

I've also tested the proposed fix for keeping SignalServiceMessageSender up-to-date with the current pipe and it works fine and while I haven't managed to perform any tests on GCM-enabled devices yet, they should experience no adverse effect whatsoever (especially since most of the functionality in the patch is disabled when GCM is available). If necessary for acceptance of this PR, I will try to find a device to test it on.

Documentation

The following logs are from a session, where I freshly started Signal, sent a message over WiFi, then immediately switched WiFi off and fell back to cellular data and then sent another. Some commentary is included.

Initially, at some point after device startup, the ConnectivityService switches to WiFi. Note the IPv6 addresses.

09-19 21:14:41.463   841  1104 D ConnectivityService: Switching to new default network: NetworkAgentInfo{ ni{[type: WIFI[], state: CONNECTED/CONNECTED, reason: (unspecified), extra: "***SSID removed***", failover: false, available: true, roaming: false, metered: false]}  network{100}  nethandle{429513165534}  lp{{InterfaceName: wlan0 LinkAddresses: [fe80::20a:f5ff:fe3a:d934/64,2a02:85f:bf:d500:20a:f5ff:fe3a:d934/64,2a02:85f:bf:d500:9541:b577:fabb:c7fc/64,]  Routes: [fe80::/64 -> :: wlan0,::/0 -> fe80::1 wlan0,2a02:85f:bf:d500::/64 -> :: wlan0,] DnsAddresses: [fe80::1%wlan0,] Domains: null MTU: 0 TcpBufferSizes: 524288,1048576,2097152,262144,524288,1048576}}  nc{[ Transports: WIFI Capabilities: NOT_METERED&INTERNET&NOT_RESTRICTED&TRUSTED&NOT_VPN&FOREGROUND LinkUpBandwidth>=1048576Kbps LinkDnBandwidth>=1048576Kbps SignalStrength: -61]}  Score{16}  everValidated{false}  lastValidated{false}  created{true} lingering{false} explicitlySelected{false} acceptUnvalidated{false} everCaptivePortalDetected{false} lastCaptivePortalDetected{false} }

I then send the message, with everything working smoothly and disable WiFi seconds later:

09-19 21:25:36.085   841  1661 D WifiService: setWifiEnabled: false pid=1155, uid=10033
09-19 21:25:36.089   841  1030 D WifiStateMachine: WifiStateMachine: Leaving Connected state
09-19 21:25:36.091   841  1030 D WifiNative-HAL: stopRssiMonitoring, cmdId 0
09-19 21:25:36.095   841  1032 D DhcpClient: doQuit
09-19 21:25:36.097   841  1104 D ConnectivityService: NetworkAgentInfo [WIFI () - 102] EVENT_NETWORK_INFO_CHANGED, going from CONNECTED to DISCONNECTED
09-19 21:25:36.097   841  1104 D ConnectivityService: NetworkAgentInfo [WIFI () - 102] got DISCONNECTED, was satisfying 6
...
09-19 21:25:36.232   841  1104 D ConnectivityService: Sending DISCONNECTED broadcast for type 1 NetworkAgentInfo [WIFI () - 102] isDefaultNetwork=true
...
09-19 21:25:36.344   841  4142 D wifi    : set interface wlan0 flags (DOWN)

Soon after that ConnectivityService enables cellular data. Note the different addresses (now IPv4):

09-19 21:25:39.662   841  1993 D ConnectivityService: registerNetworkAgent NetworkAgentInfo{ ni{[type: MOBILE[HSPA+], state: CONNECTED/CONNECTED, reason: connected, extra: ***removed***, failover: false, available: true, roaming: false, metered: true]}  network{103}  nethandle{442398067422}  lp{{InterfaceName: rmnet0 LinkAddresses: [100.99.34.21/30,]  Routes: [0.0.0.0/0 -> 100.99.34.22 rmnet0,] DnsAddresses: [213.249.18.248,213.249.18.251,] Domains: null MTU: 1500 TcpBufferSizes: 4094,87380,1220608,4096,16384,1220608}}  nc{[ Transports: CELLULAR Capabilities: SUPL&INTERNET&NOT_RESTRICTED&TRUSTED&NOT_VPN LinkUpBandwidth>=5898Kbps LinkDnBandwidth>=14336Kbps Specifier: <1>]}  Score{10}  everValidated{false}  lastValidated{false}  created{false} lingering{false} explicitlySelected{false} acceptUnvalidated{false} everCaptivePortalDetected{false} lastCaptivePortalDetected{false} }
09-19 21:25:39.662   841  1104 D ConnectivityService: NetworkAgentInfo [MOBILE (HSPA+) - 103] EVENT_NETWORK_INFO_CHANGED, going from null to CONNECTED
09-19 21:25:39.665   841  1104 D ConnectivityService: Adding iface rmnet0 to network 103
09-19 21:25:39.717   841  1104 D ConnectivityService: Setting DNS servers for network 103 to [/213.249.18.248, /213.249.18.251]
09-19 21:25:39.724   841  1104 W ConnectivityExtension: ConnectivityExt jar file not present
09-19 21:25:39.735   841  1104 W ConnectivityExtension: ConnectivityExt jar file not present
09-19 21:25:39.750   841  1104 D ConnectivityService: Switching to new default network: NetworkAgentInfo{ ni{[type: MOBILE[HSPA+], state: CONNECTED/CONNECTED, reason: connected, extra: ***removed***, failover: false, available: true, roaming: false, metered: true]}  network{103}  nethandle{442398067422}  lp{{InterfaceName: rmnet0 LinkAddresses: [100.99.34.21/30,]  Routes: [0.0.0.0/0 -> 100.99.34.22 rmnet0,] DnsAddresses: [213.249.18.248,213.249.18.251,] Domains: null MTU: 1500 TcpBufferSizes: 4094,87380,1220608,4096,16384,1220608}}  nc{[ Transports: CELLULAR Capabilities: SUPL&INTERNET&NOT_RESTRICTED&TRUSTED&NOT_VPN&FOREGROUND LinkUpBandwidth>=5898Kbps LinkDnBandwidth>=14336Kbps Specifier: <1>]}  Score{10}  everValidated{false}  lastValidated{false}  created{true} lingering{false} explicitlySelected{false} acceptUnvalidated{false} everCaptivePortalDetected{false} lastCaptivePortalDetected{false} }
09-19 21:25:39.756   841  1104 D ConnectivityService: Sending CONNECTED broadcast for type 0 NetworkAgentInfo [MOBILE (HSPA+) - 103] isDefaultNetwork=true
09-19 21:25:39.788   841  6472 D NetworkMonitor/NetworkAgentInfo [MOBILE (HSPA+) - 103]: PROBE_DNS OK 36ms, www.google.com=172.217.16.68
09-19 21:25:39.789   841  6473 D NetworkMonitor/NetworkAgentInfo [MOBILE (HSPA+) - 103]: PROBE_DNS OK 39ms, connectivitycheck.gstatic.com=216.58.212.3
09-19 21:25:39.789   841  6472 W System  : ClassLoader referenced unknown path: /system/framework/tcmclient.jar
09-19 21:25:39.852   841  6473 D NetworkMonitor/NetworkAgentInfo [MOBILE (HSPA+) - 103]: PROBE_HTTP http://connectivitycheck.gstatic.com/generate_204 time=60ms ret=204 request={User-Agent=[Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36]} headers={null=[HTTP/1.1 204 No Content], Age=[296], Connection=[keep-alive], Content-Length=[0], Date=[Wed, 19 Sep 2018 18:20:43 GMT], X-Android-Received-Millis=[1537381539851], X-Android-Response-Source=[NETWORK 204], X-Android-Selected-Protocol=[http/1.1], X-Android-Sent-Millis=[1537381539825]}
09-19 21:25:40.155   841  6472 D NetworkMonitor/NetworkAgentInfo [MOBILE (HSPA+) - 103]: PROBE_HTTPS https://www.google.com/generate_204 time=364ms ret=204 request={User-Agent=[Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36]} headers={null=[HTTP/1.1 204 No Content], Alt-Svc=[quic=":443"; ma=2592000; v="44,43,39,35"], Content-Length=[0], Date=[Wed, 19 Sep 2018 18:25:40 GMT], X-Android-Received-Millis=[1537381540153], X-Android-Response-Source=[NETWORK 204], X-Android-Selected-Protocol=[http/1.1], X-Android-Sent-Millis=[1537381539948]}
09-19 21:25:40.160   841  1104 D ConnectivityService: NetworkAgentInfo [MOBILE (HSPA+) - 103] validation passed

When I try to send another message, a problem is detected and an alternate delivery method is used:

09-19 21:25:43.907  6260  6282 W SignalServiceMessageSender: Transmitting over pipe...
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender: java.io.IOException: java.util.concurrent.TimeoutException
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.whispersystems.signalservice.api.SignalServiceMessagePipe.send(SignalServiceMessagePipe.java:130)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.whispersystems.signalservice.api.SignalServiceMessageSender.sendMessage(SignalServiceMessageSender.java:660)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.whispersystems.signalservice.api.SignalServiceMessageSender.sendMessage(SignalServiceMessageSender.java:167)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.thoughtcrime.securesms.jobs.PushTextSendJob.deliver(PushTextSendJob.java:115)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.thoughtcrime.securesms.jobs.PushTextSendJob.onPushSend(PushTextSendJob.java:58)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.thoughtcrime.securesms.jobs.PushSendJob.onSend(PushSendJob.java:74)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.thoughtcrime.securesms.jobs.SendJob.onRun(SendJob.java:43)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.thoughtcrime.securesms.jobs.MasterSecretJob.onRun(MasterSecretJob.java:18)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.thoughtcrime.securesms.jobmanager.JobConsumer.runJob(JobConsumer.java:74)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.thoughtcrime.securesms.jobmanager.JobConsumer.run(JobConsumer.java:47)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender: Caused by: java.util.concurrent.TimeoutException
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.whispersystems.signalservice.internal.util.concurrent.SettableFuture.get(SettableFuture.java:84)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    at org.whispersystems.signalservice.api.SignalServiceMessagePipe.send(SignalServiceMessagePipe.java:119)
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender:    ... 9 more
09-19 21:25:53.913  6260  6282 W SignalServiceMessageSender: 
09-19 21:25:53.919  6260  6282 W SignalServiceMessageSender: Falling back to new connection...
09-19 21:25:53.919  6260  6282 W SignalServiceMessageSender: Not transmitting over pipe...

Nevertheless the pipe (persistent connection) is not recycled, so that Signal happily goes on pinging a dead connection (note the lack of response from the server) until, after some time, the failure is detected and the socket recycled. All pending receipts and messages are then immediately received through the new connection:

09-19 21:41:34.621  6260  6300 W MessageRetrievalService: Application level read timeout...
09-19 21:41:34.626  6260  6300 D MessageRetrievalService: Network requirement: true, active activities: 0, push pending: 0, gcm disabled: true
09-19 21:41:34.629  6260  6300 I MessageRetrievalService: Reading message...
09-19 21:42:01.124  6260  6330 W WebSocketConnection: Sending keep alive...
09-19 21:42:01.149  6260  6330 W RealtimeSleepTimer: Setting alarm to wake up in 55000ms.
09-19 21:42:34.630  6260  6300 W MessageRetrievalService: Application level read timeout...
09-19 21:42:34.635  6260  6300 D MessageRetrievalService: Network requirement: true, active activities: 0, push pending: 0, gcm disabled: true
09-19 21:42:34.638  6260  6300 I MessageRetrievalService: Reading message...
09-19 21:42:56.166  6260  6330 W WebSocketConnection: Sending keep alive...
09-19 21:42:56.170  6260  6330 W RealtimeSleepTimer: Setting alarm to wake up in 55000ms.
09-19 21:43:34.639  6260  6300 W MessageRetrievalService: Application level read timeout...
09-19 21:43:34.646  6260  6300 D MessageRetrievalService: Network requirement: true, active activities: 0, push pending: 0, gcm disabled: true
09-19 21:43:34.649  6260  6300 I MessageRetrievalService: Reading message...
09-19 21:43:51.188  6260  6330 W WebSocketConnection: Sending keep alive...
09-19 21:43:51.194  6260  6330 W RealtimeSleepTimer: Setting alarm to wake up in 55000ms.
09-19 21:44:34.650  6260  6300 W MessageRetrievalService: Application level read timeout...
09-19 21:44:34.656  6260  6300 D MessageRetrievalService: Network requirement: true, active activities: 0, push pending: 0, gcm disabled: true
09-19 21:44:34.659  6260  6300 I MessageRetrievalService: Reading message...
09-19 21:44:46.238  6260  6330 W WebSocketConnection: Sending keep alive...
09-19 21:44:46.242  6260  6330 W RealtimeSleepTimer: Setting alarm to wake up in 55000ms.
09-19 21:45:01.095  6260  6302 W WebSocketConnection: onFailure()
09-19 21:45:01.099  6260  6302 W WebSocketConnection: javax.net.ssl.SSLException: Read error: ssl=0x977d7dc0: I/O error during system call, Connection timed out
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at com.android.org.conscrypt.NativeCrypto.SSL_read(Native Method)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:758)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okio.Okio$2.read(Okio.java:139)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okio.RealBufferedSource.request(RealBufferedSource.java:67)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:265)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:204)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1133)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:607)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 	at java.lang.Thread.run(Thread.java:761)
09-19 21:45:01.099  6260  6302 W WebSocketConnection: 
09-19 21:45:01.100  6260  6302 W WebSocketConnection: onClose()...
09-19 21:45:01.101  6260  6302 W SignalCommunicationModule: onDisconnected()
09-19 21:45:01.303  6260  6302 W WebSocketConnection: WSC connect()...
09-19 21:45:01.335  6260  6302 I SignalCommunicationModule: onConnecting()
09-19 21:45:03.453  6260  7343 W WebSocketConnection: onConnected()
09-19 21:45:03.455  6260  7343 I SignalCommunicationModule: onConnected()
09-19 21:45:03.459  6260  7343 W WebSocketConnection: WSC onMessage()
09-19 21:45:03.461  6260  7343 W WebSocketConnection: Message Type: 1
09-19 21:45:03.481  6260  6300 W SignalServiceEnvelope: Our MAC: ***removed***
09-19 21:45:03.482  6260  6300 W SignalServiceEnvelope: Thr MAC: ***removed***
09-19 21:45:03.485  6260  7343 W WebSocketConnection: WSC onMessage()
09-19 21:45:03.486  6260  7354 W RealtimeSleepTimer: Setting alarm to wake up in 55000ms.
09-19 21:45:03.487  6260  6300 I MessageRetrievalService: Retrieved envelope! ***removed***

Notice that Netd, only logs an error message at the time the WiFi connection goes down:

09-19 21:25:36.306   327   872 E Netd    : netlink response contains error (No such file or directory)

Other such messages are logged at other times, pointing to a potential misconfiguration of the Android system as the source of the problem (although see also below):

09-19 21:20:43.355   327   872 E Netd    : netlink response contains error (No such process)
09-19 21:20:43.355   327   872 E Netd    : Failed to dump IPv4 sockets for permission change: Invalid argument
09-19 21:20:43.356   327   872 E Netd    : Failed to close sockets changing netId 101 to permission 1: Invalid argument
09-19 21:21:39.910   327   872 E Netd    : netlink response contains error (No such file or directory)

On another device, which uses precisely the same LineageOS distribution as mine (although a different port), but which doesn't display the problem, the Netd daemon kills the dangling sockets as soon as the WiFi connection goes down:

09-19 11:23:19.631   826  1715 D ConnectivityService: NetworkAgentInfo [WIFI () - 176] EVENT_NETWORK_INFO_CHANGED, going from CONNECTED to DISCONNECTED
09-19 11:23:19.631   826  1715 D ConnectivityService: NetworkAgentInfo [WIFI () - 176] got DISCONNECTED, was satisfying 7
09-19 11:23:19.635   826 32696 D DhcpClient: Receive thread stopped
09-19 11:23:19.639   328  1273 D CommandListener: Setting iface cfg
09-19 11:23:19.640 26847 26847 D wpa_supplicant: wlan0: Control interface command 'SET_NETWORK [REMOVED]'
09-19 11:23:19.640 26847 26847 D wpa_supplicant: CTRL_IFACE: SET_NETWORK id=2 name='bssid'
09-19 11:23:19.640 26847 26847 D wpa_supplicant: CTRL_IFACE: value - hexdump(len=3): [REMOVED]
09-19 11:23:19.644   328  1265 I Netd    : Destroyed 10 sockets on 192.168.**removed** in 4.5 ms
09-19 11:23:19.646  1059  1115 W WebSocketConnection: onFailure()
09-19 11:23:19.647   826 32694 D DhcpClient: onQuitting
09-19 11:23:19.648  1059  1115 W WebSocketConnection: javax.net.ssl.SSLException: Read error: ssl=0x99d438c0: I/O error during system call, Software caused connection abort
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at com.android.org.conscrypt.NativeCrypto.SSL_read(Native Method)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:758)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okio.Okio$2.read(Okio.java:139)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okio.RealBufferedSource.request(RealBufferedSource.java:67)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:265)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:204)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1133)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:607)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 	at java.lang.Thread.run(Thread.java:761)
09-19 11:23:19.648  1059  1115 W WebSocketConnection: 
09-19 11:23:19.648  1059  1115 W WebSocketConnection: onClose()...
09-19 11:23:19.648  1059  1115 W SignalCommunicationModule: onDisconnected()

This prompts Signal to recycle the socket immediately, so that no disruption is experienced. The case is similar when re-enabling WiFi:

09-19 11:24:03.324   826  1715 D ConnectivityService: handleLingerComplete for NetworkAgentInfo [MOBILE (LTE) - 177]
09-19 11:24:03.325   826  1715 D ConnectivityService: NetworkAgentInfo [MOBILE (LTE) - 177] got DISCONNECTED, was satisfying 6
09-19 11:24:03.326  2018  2192 D DcNetworkAgent: NetworkAgent: NetworkAgent channel lost
09-19 11:24:03.326  2018  2192 D DcNetworkAgent: NetworkAgent: DcNetworkAgent: [unwanted]: disconnect apnContext={mApnType=default mState=CONNECTED mWaitingApns={[[ApnSettingV3] ***removed***, 15, 20205, ***removed***, , , , , , 3, default | supl, IP, IP, true, 0, 0, 0, false, 0, 0, 0, 0, , , false]} mApnSetting={[ApnSettingV3] Vodafone Web, 15, 20205, ***removed***, , , , , , 3, default | supl, IP, IP, true, 0, 0, 0, false, 0, 0, 0, 0, , , false} mReason=dataDisabled mDataEnabled=false mDependencyMet=true}
09-19 11:24:03.333   826  1715 W ConnectivityExtension: ConnectivityExt jar file not present
09-19 11:24:03.377 26847 26847 D wpa_supplicant: RTM_NEWLINK: ifi_index=13 ifname=rmnet_data0 operstate=2 linkmode=0 ifi_family=0 ifi_flags=0x0 ()
09-19 11:24:03.407   328  1265 I Netd    : Destroyed 3 sockets on ***removed*** in 0.5 ms
09-19 11:24:03.408  1059  1370 W WebSocketConnection: onFailure()
09-19 11:24:03.410  1059  1370 W WebSocketConnection: javax.net.ssl.SSLException: Read error: ssl=0x98d0d280: I/O error during system call, Software caused connection abort
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at com.android.org.conscrypt.NativeCrypto.SSL_read(Native Method)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:758)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okio.Okio$2.read(Okio.java:139)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okio.RealBufferedSource.request(RealBufferedSource.java:67)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:265)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:204)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1133)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:607)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 	at java.lang.Thread.run(Thread.java:761)
09-19 11:24:03.410  1059  1370 W WebSocketConnection: 
09-19 11:24:03.410  1059  1370 W WebSocketConnection: onClose()...
09-19 11:24:03.410  1059  1370 W SignalCommunicationModule: onDisconnected()
09-19 11:24:03.449   328  1273 E Netd    : netlink response contains error (No such file or directory)
09-19 11:24:03.610  1059  1370 W WebSocketConnection: WSC connect()...
09-19 11:24:03.641  1059  1370 I SignalCommunicationModule: onConnecting()
09-19 11:24:04.312  1059  1894 W WebSocketConnection: onConnected()

Note that Netd seems to print similar, although much fewer, error messages in this case as well.

Co-authored-by: Dimitris Papavasiliou <dpapavas@gmail.com> Co-authored-by: Ralf Kohrt <rkohrt@phys.ethz.ch>

dpapavas changed the title ~~Netchangefix 2~~ Detect and recycle broken connections on non-GCM devices Sep 30, 2018

This was referenced Sep 30, 2018

Detect and recycle broken connections on non-GCM devices signalapp/Signal-Android#8230

Closed

Network connection seems to get stuck (without Google services) signalapp/Signal-Android#7638

Closed

dpapavas force-pushed the netchangefix-2 branch from 1b9a981 to 22fd1bc Compare December 15, 2018 13:38

theBoatman mentioned this pull request Apr 3, 2019

fixing RealtimeSleepTimer to work in concurrent threads #70

Closed

Dimitris Papavasiliou and others added 3 commits May 4, 2019 20:11

Detect and recycle dangling websockets.

daa5dc9

Co-authored-by: Dimitris Papavasiliou <dpapavas@gmail.com> Co-authored-by: Ralf Kohrt <rkohrt@phys.ethz.ch>

Properly handle keep-alive alarms for multiple pipes.

fc14abc

Share IncomingMessageObserver's pipe by reference.

510a25b

dpapavas force-pushed the netchangefix-2 branch from 22fd1bc to 510a25b Compare May 11, 2019 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect and recycle broken connections on non-GCM devices #62

Detect and recycle broken connections on non-GCM devices #62

dpapavas commented Sep 30, 2018

Detect and recycle broken connections on non-GCM devices #62

Are you sure you want to change the base?

Detect and recycle broken connections on non-GCM devices #62

Conversation

dpapavas commented Sep 30, 2018

Summary

Issue

Patch

Fix to keep SignalServiceMessageSender up-to-date

Testing

Documentation

Fix to keep `SignalServiceMessageSender` up-to-date