Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH and Ctrl-C leave attic serve instances alive and prevent further upload #545

Closed
filcuc opened this issue Jan 12, 2016 · 23 comments
Closed

Comments

@filcuc
Copy link

filcuc commented Jan 12, 2016

Hi i've just filled this bug report for attic
I didn't test it with borg.
Do you know if this is still the case?

@ThomasWaldmann
Copy link
Member

Yes, I've seen your issue.

I am not completely sure about it (and it might also depend on the ssh/sshd configuration a bit).
It might be fixed (there was a ton of fixes, see #5 - compared to what attic repo has).

I just tried borg create --ssh--> localhost repo and quickly interrupted it 5 times and it did not hang for me.

So maybe try it, if it works for you, we could check one more attic issue from our list.

@filcuc
Copy link
Author

filcuc commented Jan 12, 2016

Thank you for the reply, i'll try again with borg tomorrow and see if something change.
However before writing the issue i quickly look at the attic source code and i didn't see in the remote server logic some sort of timeout handling nor a ping pong protocol. Usually client server architectures involve a ping pong for making both part aware of the other parter state.

@ThomasWaldmann
Copy link
Member

Well, there are pipes and if the connection breaks down, they break. That should be dealt with within remote.py.

@dragetd
Copy link
Contributor

dragetd commented Jan 13, 2016

I have been running a 8-day ssh-backup over a slow upload. Interrupted it multiple times by Ctrl+C, network dying or similar issues.

It did happen a few times that the lock was not removed. borg break-lock fixed it.
Once I had a corruption of the repository index. borg check --repair was able to fix it.

No other issues were encountered.

@filcuc
Copy link
Author

filcuc commented Jan 17, 2016

Tested with borg 0.29.0 after a couple of CTRL-C i get stuck with

emote: Borg 0.29.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/vagrant/borg/borg/borg/locking.py", line 134, in acquire
Remote: FileExistsError: [Errno 17] File exists: '/home/filippo/test.borg/lock.exclusive'
Remote: 
Remote: During handling of the above exception, another exception occurred:
Remote: 
Remote: Traceback (most recent call last):
Remote:   File "/vagrant/borg/borg/borg/remote.py", line 96, in serve
Remote:   File "/vagrant/borg/borg/borg/remote.py", line 121, in open
Remote:   File "/vagrant/borg/borg/borg/repository.py", line 63, in __init__
Remote:   File "/vagrant/borg/borg/borg/repository.py", line 141, in open
Remote:   File "/vagrant/borg/borg/borg/locking.py", line 267, in acquire
Remote:   File "/vagrant/borg/borg/borg/locking.py", line 118, in __enter__
Remote:   File "/vagrant/borg/borg/borg/locking.py", line 140, in acquire
Remote: borg.locking.LockTimeout: /home/filippo/test.borg/lock.exclusive
Remote: Platform: Linux server1.lightwolf.eu 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 2014 x86_64 x86_64
Remote: Linux: debian jessie/sid   LibC: glibc 2.3
Remote: Python: CPython 3.5.1
Remote: 
Remote Exception (see remote log for the traceback)
Platform: Linux as5951g 4.3.3-2-ARCH #1 SMP PREEMPT Wed Dec 23 20:09:18 CET 2015 x86_64 
Linux: arch    LibC: glibc 2.3.4
Python: CPython 3.5.1

@filcuc
Copy link
Author

filcuc commented Jan 17, 2016

on the server with ps -aux i notice two borg processes

filippo  28325  0.0  0.3  10992  3044 ?        Ss   16:56   0:00 borg serve --umask=077
filippo  28326  0.0  1.7 131400 17372 ?        S    16:56   0:00 borg serve --umask=077

@filcuc
Copy link
Author

filcuc commented Jan 17, 2016

The command used for uploading the content is

borg create --stats --progress filippo@mytestserver:~/test.borg::Tuesday /home/filippo/ownCloud/

@filcuc
Copy link
Author

filcuc commented Jan 17, 2016

@ThomasWaldmann regarding the pipes breaking down i'm not quite sure they break if the connection drop down. I work with sockets and i learned the hard way the need for an explicit ping pong protocol

@ThomasWaldmann
Copy link
Member

@filcuc that is a leftover lock, there is another ticket that deals with that. borg break-lock can be used to remove that manually (when being sure that there should be no lock).

ssh btw has also such a "ping" feature that could be used.

@ThomasWaldmann
Copy link
Member

maybe related: jborg/attic#130 jborg/attic#323

@dragetd
Copy link
Contributor

dragetd commented Feb 19, 2016

My offsite-backup is behind a firewall. The remote-backup-host had a script that auto-created a reverse ssh session which I was pushing the backups through.

Just recently I realized that SSH relies on TCPKeepAlive to kill stale SSH tunnel… but the reverse SSH session kept running if my backup client had network issues or borg was killed some other way.

Just one day ago I added:
ClientAliveInterval 180
ClientAliveCountMax 3
which should kill dead SSH session after 9 minutes. My first tests show that it might have helped.

It could be very well not related to the original issue, but then again, I thought I'd share my findings.

@filcuc
Copy link
Author

filcuc commented Mar 14, 2016

I solved the same way putting
ClientAliveInterval 10
ClientAliveCountMax 3

@filcuc
Copy link
Author

filcuc commented Mar 14, 2016

maybe we could close this one...

@enkore
Copy link
Contributor

enkore commented Mar 14, 2016

If it's not in the docs it should probably go into the FAQ?

@ThomasWaldmann
Copy link
Member

@dragetd do you still think these 2 lines were helpful? If so, can you remove them again to see if the problem comes back?

If we are sure about this being helpful, we need to think about specific values and then add to FAQ. There is already a somehow related ssh Q in there.

@ThomasWaldmann
Copy link
Member

Any news?

@ThomasWaldmann
Copy link
Member

I'ld like to close this issue soon. So if someone thinks something should be done about it (add stuff to docs, fix code), please speak up (and consider your recent past experience with up-to-date borg versions).

@dragetd
Copy link
Contributor

dragetd commented Jun 21, 2016

Please excuse my lack of response, lost this one out of sight. Indeed, my hanging instance was related only to the SSH timeouts. With l longer timeouts, the process would be up longer. With a lock-wait long enough and short timeouts, my backup runs fine.

From my side this can be closed as no borg-related.
Unless we might want to add an entry to the FAQ - should I write something up?

@ThomasWaldmann
Copy link
Member

Yes, FAQ or some section about remote repos sounds good.

# /etc/ssh/sshd_config on borg repo server
# kill connection to client after ClientAliveCountMax * ClientAliveInterval seconds with no response
ClientAliveInterval 20
ClientAliveCountMax 3

I just want to be careful: is there some scenario where such settings could be counter-productive?

@ThomasWaldmann
Copy link
Member

@dragetd yes, a PR against 1.0-maint branch would be welcome.

How is this related to lock wait time?

@enkore
Copy link
Contributor

enkore commented Jun 21, 2016

How is this related to lock wait time?

When lock wait time > SSH timeout then a broken connection is closed (BrokenPipeError in borg serve -> lock cleanup) before another Borg times out waiting for the lock.

@ThomasWaldmann
Copy link
Member

I still don't understand.

If client C1 is doing a backup to repo server, a borg serve process S1 will be created dealing with the connection and holding a lock L1. The connection might run for a undefined time (e.g. 1h) before it runs into a connection issue. With the above settings, the broken down ssh connection will be terminated server-side after 1 minute and the lock L1 will get released.

A client C2 could have been trying to create another backup since short after C1 started. It will have a related S2 borg serve process trying to get a lock L2 since almost 1h. Then S1 will terminate and release the lock L1 after the connection breakdown is handled. S2 will get the lock L2 if it waits long enough - longer than this 1h.

So the lock wait time seems to be related with the undefined backup-runtime-until-a-problem happens rather than with the ssh parameters. If you expect multi-client serialized backups (like all waiting in a queue), the lock wait time needs to be >> expected backup duration of all other clients.

@ThomasWaldmann ThomasWaldmann self-assigned this Jun 24, 2016
@ThomasWaldmann
Copy link
Member

Oh, I had a different (fully concurrent) scenario in mind.

It helps for this scenario:

If you have multiple borg create ... ; borg create ... in a serialized way in a single script, you need to give them --lock-wait N (with N a bit more than the time the server needs to terminate broken down connections and release the lock).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants