SSH and Ctrl-C leave attic serve instances alive and prevent further upload #545

filcuc · 2016-01-12T22:52:55Z

Hi i've just filled this bug report for attic
I didn't test it with borg.
Do you know if this is still the case?

ThomasWaldmann · 2016-01-12T23:02:50Z

Yes, I've seen your issue.

I am not completely sure about it (and it might also depend on the ssh/sshd configuration a bit).
It might be fixed (there was a ton of fixes, see #5 - compared to what attic repo has).

I just tried borg create --ssh--> localhost repo and quickly interrupted it 5 times and it did not hang for me.

So maybe try it, if it works for you, we could check one more attic issue from our list.

filcuc · 2016-01-12T23:26:47Z

Thank you for the reply, i'll try again with borg tomorrow and see if something change.
However before writing the issue i quickly look at the attic source code and i didn't see in the remote server logic some sort of timeout handling nor a ping pong protocol. Usually client server architectures involve a ping pong for making both part aware of the other parter state.

ThomasWaldmann · 2016-01-12T23:30:16Z

Well, there are pipes and if the connection breaks down, they break. That should be dealt with within remote.py.

dragetd · 2016-01-13T02:47:31Z

I have been running a 8-day ssh-backup over a slow upload. Interrupted it multiple times by Ctrl+C, network dying or similar issues.

It did happen a few times that the lock was not removed. borg break-lock fixed it.
Once I had a corruption of the repository index. borg check --repair was able to fix it.

No other issues were encountered.

filcuc · 2016-01-17T17:06:29Z

Tested with borg 0.29.0 after a couple of CTRL-C i get stuck with

emote: Borg 0.29.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/vagrant/borg/borg/borg/locking.py", line 134, in acquire
Remote: FileExistsError: [Errno 17] File exists: '/home/filippo/test.borg/lock.exclusive'
Remote: 
Remote: During handling of the above exception, another exception occurred:
Remote: 
Remote: Traceback (most recent call last):
Remote:   File "/vagrant/borg/borg/borg/remote.py", line 96, in serve
Remote:   File "/vagrant/borg/borg/borg/remote.py", line 121, in open
Remote:   File "/vagrant/borg/borg/borg/repository.py", line 63, in __init__
Remote:   File "/vagrant/borg/borg/borg/repository.py", line 141, in open
Remote:   File "/vagrant/borg/borg/borg/locking.py", line 267, in acquire
Remote:   File "/vagrant/borg/borg/borg/locking.py", line 118, in __enter__
Remote:   File "/vagrant/borg/borg/borg/locking.py", line 140, in acquire
Remote: borg.locking.LockTimeout: /home/filippo/test.borg/lock.exclusive
Remote: Platform: Linux server1.lightwolf.eu 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 2014 x86_64 x86_64
Remote: Linux: debian jessie/sid   LibC: glibc 2.3
Remote: Python: CPython 3.5.1
Remote: 
Remote Exception (see remote log for the traceback)
Platform: Linux as5951g 4.3.3-2-ARCH #1 SMP PREEMPT Wed Dec 23 20:09:18 CET 2015 x86_64 
Linux: arch    LibC: glibc 2.3.4
Python: CPython 3.5.1

filcuc · 2016-01-17T17:08:41Z

on the server with ps -aux i notice two borg processes

filippo  28325  0.0  0.3  10992  3044 ?        Ss   16:56   0:00 borg serve --umask=077
filippo  28326  0.0  1.7 131400 17372 ?        S    16:56   0:00 borg serve --umask=077

filcuc · 2016-01-17T17:10:48Z

The command used for uploading the content is

borg create --stats --progress filippo@mytestserver:~/test.borg::Tuesday /home/filippo/ownCloud/

filcuc · 2016-01-17T17:13:56Z

@ThomasWaldmann regarding the pipes breaking down i'm not quite sure they break if the connection drop down. I work with sockets and i learned the hard way the need for an explicit ping pong protocol

ThomasWaldmann · 2016-01-17T19:12:13Z

@filcuc that is a leftover lock, there is another ticket that deals with that. borg break-lock can be used to remove that manually (when being sure that there should be no lock).

ssh btw has also such a "ping" feature that could be used.

ThomasWaldmann · 2016-02-19T06:38:55Z

maybe related: jborg/attic#130 jborg/attic#323

dragetd · 2016-02-19T11:05:07Z

My offsite-backup is behind a firewall. The remote-backup-host had a script that auto-created a reverse ssh session which I was pushing the backups through.

Just recently I realized that SSH relies on TCPKeepAlive to kill stale SSH tunnel… but the reverse SSH session kept running if my backup client had network issues or borg was killed some other way.

Just one day ago I added:
ClientAliveInterval 180
ClientAliveCountMax 3
which should kill dead SSH session after 9 minutes. My first tests show that it might have helped.

It could be very well not related to the original issue, but then again, I thought I'd share my findings.

filcuc · 2016-03-14T22:00:52Z

I solved the same way putting
ClientAliveInterval 10
ClientAliveCountMax 3

filcuc · 2016-03-14T22:01:06Z

maybe we could close this one...

enkore · 2016-03-14T22:48:43Z

If it's not in the docs it should probably go into the FAQ?

ThomasWaldmann · 2016-03-15T13:32:21Z

@dragetd do you still think these 2 lines were helpful? If so, can you remove them again to see if the problem comes back?

If we are sure about this being helpful, we need to think about specific values and then add to FAQ. There is already a somehow related ssh Q in there.

ThomasWaldmann · 2016-04-05T22:04:21Z

Any news?

ThomasWaldmann · 2016-06-21T13:02:20Z

I'ld like to close this issue soon. So if someone thinks something should be done about it (add stuff to docs, fix code), please speak up (and consider your recent past experience with up-to-date borg versions).

dragetd · 2016-06-21T13:32:16Z

Please excuse my lack of response, lost this one out of sight. Indeed, my hanging instance was related only to the SSH timeouts. With l longer timeouts, the process would be up longer. With a lock-wait long enough and short timeouts, my backup runs fine.

From my side this can be closed as no borg-related.
Unless we might want to add an entry to the FAQ - should I write something up?

ThomasWaldmann · 2016-06-21T14:07:39Z

Yes, FAQ or some section about remote repos sounds good.

# /etc/ssh/sshd_config on borg repo server
# kill connection to client after ClientAliveCountMax * ClientAliveInterval seconds with no response
ClientAliveInterval 20
ClientAliveCountMax 3

I just want to be careful: is there some scenario where such settings could be counter-productive?

ThomasWaldmann · 2016-06-21T14:57:44Z

@dragetd yes, a PR against 1.0-maint branch would be welcome.

How is this related to lock wait time?

enkore · 2016-06-21T15:00:40Z

How is this related to lock wait time?

When lock wait time > SSH timeout then a broken connection is closed (BrokenPipeError in borg serve -> lock cleanup) before another Borg times out waiting for the lock.

ThomasWaldmann · 2016-06-24T20:59:17Z

I still don't understand.

If client C1 is doing a backup to repo server, a borg serve process S1 will be created dealing with the connection and holding a lock L1. The connection might run for a undefined time (e.g. 1h) before it runs into a connection issue. With the above settings, the broken down ssh connection will be terminated server-side after 1 minute and the lock L1 will get released.

A client C2 could have been trying to create another backup since short after C1 started. It will have a related S2 borg serve process trying to get a lock L2 since almost 1h. Then S1 will terminate and release the lock L1 after the connection breakdown is handled. S2 will get the lock L2 if it waits long enough - longer than this 1h.

So the lock wait time seems to be related with the undefined backup-runtime-until-a-problem happens rather than with the ssh parameters. If you expect multi-client serialized backups (like all waiting in a queue), the lock wait time needs to be >> expected backup duration of all other clients.

ThomasWaldmann · 2016-06-24T21:28:57Z

Oh, I had a different (fully concurrent) scenario in mind.

It helps for this scenario:

If you have multiple borg create ... ; borg create ... in a serialized way in a single script, you need to give them --lock-wait N (with N a bit more than the time the server needs to terminate broken down connections and release the lock).

ThomasWaldmann added the documentation label Jun 21, 2016

ThomasWaldmann added this to the 1.0.4 fixes for 1.0-maint milestone Jun 21, 2016

ThomasWaldmann self-assigned this Jun 24, 2016

ThomasWaldmann closed this as completed in b10025c Jun 25, 2016

ThomasWaldmann mentioned this issue Jul 8, 2016

Dealing with attic issues #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSH and Ctrl-C leave attic serve instances alive and prevent further upload #545

SSH and Ctrl-C leave attic serve instances alive and prevent further upload #545

filcuc commented Jan 12, 2016

ThomasWaldmann commented Jan 12, 2016

filcuc commented Jan 12, 2016

ThomasWaldmann commented Jan 12, 2016

dragetd commented Jan 13, 2016

filcuc commented Jan 17, 2016

filcuc commented Jan 17, 2016

filcuc commented Jan 17, 2016

filcuc commented Jan 17, 2016

ThomasWaldmann commented Jan 17, 2016

ThomasWaldmann commented Feb 19, 2016

dragetd commented Feb 19, 2016

filcuc commented Mar 14, 2016

filcuc commented Mar 14, 2016

enkore commented Mar 14, 2016

ThomasWaldmann commented Mar 15, 2016

ThomasWaldmann commented Apr 5, 2016

ThomasWaldmann commented Jun 21, 2016

dragetd commented Jun 21, 2016

ThomasWaldmann commented Jun 21, 2016

ThomasWaldmann commented Jun 21, 2016

enkore commented Jun 21, 2016

ThomasWaldmann commented Jun 24, 2016

ThomasWaldmann commented Jun 24, 2016

SSH and Ctrl-C leave attic serve instances alive and prevent further upload #545

SSH and Ctrl-C leave attic serve instances alive and prevent further upload #545

Comments

filcuc commented Jan 12, 2016

ThomasWaldmann commented Jan 12, 2016

filcuc commented Jan 12, 2016

ThomasWaldmann commented Jan 12, 2016

dragetd commented Jan 13, 2016

filcuc commented Jan 17, 2016

filcuc commented Jan 17, 2016

filcuc commented Jan 17, 2016

filcuc commented Jan 17, 2016

ThomasWaldmann commented Jan 17, 2016

ThomasWaldmann commented Feb 19, 2016

dragetd commented Feb 19, 2016

filcuc commented Mar 14, 2016

filcuc commented Mar 14, 2016

enkore commented Mar 14, 2016

ThomasWaldmann commented Mar 15, 2016

ThomasWaldmann commented Apr 5, 2016

ThomasWaldmann commented Jun 21, 2016

dragetd commented Jun 21, 2016

ThomasWaldmann commented Jun 21, 2016

ThomasWaldmann commented Jun 21, 2016

enkore commented Jun 21, 2016

ThomasWaldmann commented Jun 24, 2016

ThomasWaldmann commented Jun 24, 2016