Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seash upload timeouts #81

Open
choksi81 opened this issue May 29, 2014 · 1 comment
Open

Seash upload timeouts #81

choksi81 opened this issue May 29, 2014 · 1 comment
Assignees

Comments

@choksi81
Copy link
Contributor

Seash sometimes times out on uploads. We've done a number of packet traces on the host running seash, and here is what we've found out:

You can categorize three types of errors:

  1. signedcommunicate failed on session_recvmessage with error 'recv() timed out!' usually means the file was fully uploaded, but the node failed to return "Success" in time.
  2. signedcommunicate failed on session_recvmessage with error 'send() timed out!' means the file was not fully transferred.
  3. signedcommunicate failed on session_sendmessage with error 'Socket closed' is adressed in #1009 -- I didn't see this one on uploads so far, only when browsing for nodes.

The recv() issue was tackled in #971 and thought to be solved by speeding up the crypto parts of the communication between node manager and seash (parts of which remain to be improved, see #990). The packet traces show a typical string of events leading to this error:

  • The file is fully uploaded, the node acknowledges each TCP segment it sees.
  • It fails to produce a "Success" message however.
  • Therefore, seash times out after 17 seconds. This is remarkable in itself as the default timeout should be 10 seconds.
  • The node acknowledges seash's closing of its half of the connection.
  • 3-100 seconds later (empirical values), the node tries to send the "Success" message, but is (correctly) greeted with a RST by seash's TCP.

I've also seen "show files" fail with a recv() timeout, but this is rare. Surprisingly, the timeout is 10 seconds there.

The send() issue manifests like this:

  • Parts of the file are uploaded, but TCP flow control kicks in, telling seash's TCP to stop sending data. This means that the node doesn't empty its receive buffer. In my experiments with a 92kB file, only 15kB+ were transferred. Additionally, seash is blocked in session_sendmessage, which uses timeout_sockets.
  • Seash's TCP probes the node's for changes in the receive window size. The node's TCP answers that the recv window is still full. This can go on for minutes, so seash times out.
  • Finally, buffer space becomes available at the node, and it receives whatever is left in seash's TCP's send buffer. (While seash timed out on application layer, it can't get rid of "old" data it already buffered for sending.)
  • Thus, 26kB are transmitted overall in my experiment.

Both recv() and send() issues happen even if I increase seash's timeout (see #892). I tried 90 seconds, but would have had to use 300 or so for the slowest nodes.

I checked with CoMoN -- all of the nodes that produced errors had a high load average, and those with the highest numbers had the most persistent ones. This might hint at how to reproduce the problem on a node where we can locally trace node manager packets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants