Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: delay restart before status FATAL #487

Open
guettli opened this issue Sep 5, 2014 · 97 comments
Open

feature request: delay restart before status FATAL #487

guettli opened this issue Sep 5, 2014 · 97 comments

Comments

@guettli
Copy link

guettli commented Sep 5, 2014

Use case:

during a software update the server can't restart since some needed python modules are not available yet. Supervisor retries N times (N comes from config. AFAIK defaults to 3).

After N fails the state FATAL gets entered.

In my use case the programm entered state FATAL although just some seconds later the restart would have been successfull.

Can you understand this use case?

A possible solution would be to use a delay in unit "seconds": After N failed retries wait M seconds and then try again. In my case this should be done to infinity (endless loop).

@mgalgs
Copy link

mgalgs commented Dec 5, 2014

👍

A delaybetweenretries parameter would be useful (bonus points for exponential backoff)

@aztlan2k
Copy link

👍 this would be really useful.

allowing us to configure how long between retries would be great. Allowing for an backoff would also help.

i think it would be great if we could also specify a retry count of infinity (never give up) in combination with a sane value for the delay_between_retries.

@mmh
Copy link

mmh commented Jan 27, 2015

+1, would be useful.

@willybarro
Copy link

👍 would be great. There's already a PR with that feature #509, but it's not automergeable, and there's no discussion on it.

@mrook
Copy link

mrook commented Mar 10, 2015

+1

2 similar comments
@nicwest
Copy link

nicwest commented Mar 18, 2015

+1

@arinto
Copy link

arinto commented Mar 30, 2015

+1

@nereusz
Copy link

nereusz commented Apr 22, 2015

+1 This would be very useful for me

@CheatCodes
Copy link

+1

2 similar comments
@ghost
Copy link

ghost commented May 21, 2015

+1

@jsmirnov
Copy link

+1

@guettli
Copy link
Author

guettli commented May 23, 2015

Has someone enough knowledge to create patch for this?

@dbpolito
Copy link

I would love to see this too, so 👍

@detailyang
Copy link

+1

1 similar comment
@oryband
Copy link

oryband commented Jun 8, 2015

👍

@vincent-io
Copy link

Would love to have this feature. 👍

@siavashs
Copy link

👍

@guettli
Copy link
Author

guettli commented Jul 13, 2015

Why was this ticket closed? Was the feature request implemented? I can't see a code change on this issue or on #561.

@tonicospinelli
Copy link

+1

@oryband
Copy link

oryband commented Jul 14, 2015

dear sir plz implement this alrdy k thx bai

@darkone23
Copy link

👍 would like this feature, thanks!

@0x20h
Copy link

0x20h commented Aug 4, 2015

+1, would be useful for me too!

@caorong
Copy link

caorong commented Aug 18, 2015

+1, hope this!

@conanfanli
Copy link

+1

1 similar comment
@dieend
Copy link

dieend commented Aug 31, 2015

+1

@miso-belica
Copy link

+1 for delay_between_retries and also exponential backoff would be great.

@pfuender
Copy link

+1

1 similar comment
@toastbrotch
Copy link

+1

@cbj4074
Copy link

cbj4074 commented Nov 14, 2017

Several others have mentioned a "backoff" implementation and the BACKOFF state, but only 0x20h mentioned this vital clue (though, I wish he'd been more explicit):

The mechanism that implements startretries does employ a backoff strategy that increases the delay by 1 second with each attempt.

In other words, you can set startretries to a value that is large enough to cover any "expected downtime" in your workflow without causing a self-inflicted DoS scenario.

While it might be nice to have control over the backoff computation, I concur that this has been addressed in as much as a self-inflicted DoS will not result from setting startretries too generously.

@nvictor
Copy link

nvictor commented Dec 4, 2017

can anyone points to where in the docs what 0x20h describes (the "add an increasing delay for each retry") is described? thanks.

@cbj4074
Copy link

cbj4074 commented Dec 4, 2017

@nvictor If it were documented, I doubt any of us would be here. ;) I discovered that the delay is increased by one second with each retry by examining supervisord's log entries.

@rotorsolutions
Copy link

@nvictor @vlsd Has already pointed to this documentation:

http://supervisord.org/subprocess.html#process-states

Each start retry will take progressively more time.

I was also looking for a solution for my problem. I connect to an IMAP server and get a connection timeout. Starting the script after a little delay works great, but supervisor is too quick to restart (for this job).

Therefor a delay as a configuration option would be great to prevent hacks like the 'sleep' option mentioned above (which I will try now as there is not configurable alternative)

@vlsd
Copy link

vlsd commented Jan 3, 2018

So it seems like linear backoff is implemented only (exponential backoff is the other common kind), and only at one, hard-coded rate (1s per retry, as per @cbj4074, but not actually documented). I no longer have a horse in the game. It seems to me like providing the users with an ability to both switch between linear and exponential backoff and set the rate at which the backoff happens would be an ideal solution here. Failing that, documenting that the backoff is linear and set at a rate of 1 second per retry is also a solution. Simply mentioning that backoff happens, with no other details, is confusing.

@BibhuGlobussoft
Copy link

I am having the FATAL state issue due to which my supervisor goes down even if the supervisor is running but the workers not processing any jobs.
Did any body have any solution for this ?

@virusdefender
Copy link

virusdefender commented Jun 6, 2018

Thanks @jderusse

import hashlib
import os
import time
import sys
import datetime

max_backoff = 5
timeout = 60 * 60 * 6

def log(msg):
    print("[proc_wrapper %s] %s" % (str(datetime.datetime.now()), msg))

command = " ".join(sys.argv[1:])
backoff_file = hashlib.md5(command.encode("utf-8")).hexdigest()
log("Running '%s', backoff file name: %s" % (command, backoff_file))

status = os.system(command)
if status != 0:
    if not os.path.exists(backoff_file):
        seconds = -1
    else:
        with open(backoff_file) as f:
            content = f.read().split(":")
            seconds, last_timestamp = int(content[0]), int(content[1])
        if time.time() - last_timestamp > timeout:
            seconds = -1
    if seconds + 1 > max_backoff:
        seconds = max_backoff
    else:
        seconds = seconds + 1
        with open(backoff_file, "w") as f:
            f.write("%d:%d" % (seconds, int(time.time())))
    log("Command '%s' exited with status %d, sleep %ds" % (command, status, seconds))
    time.sleep(seconds)
    exit(1)
else:
    try:
        os.remove(backoff_file)
    except Exception:
        pass

I write a new script to implement this function

then add

killasgroup = true
stopasgroup = true

to [program: x] section

@yongzhang
Copy link

+1

batmat added a commit to batmat/evergreen that referenced this issue Jul 18, 2018
As explained Supervisor/supervisor#487 (comment),
`Supervisord`

> The mechanism that implements startretries does employ a backoff
> strategy that increases the delay by 1 second with each attempt.
> In other words, you can set startretries to a value that is large
>  enough to cover any "expected downtime" in your workflow without
>  causing a self-inflicted DoS scenario.

I have tested by putting a `sleep 20` in the wait-for-postgres.sh wrapper
on the backend, to delay its startup. This makes the client still
available to finally register when the backend comes up.
(default value for `startretries` is 3).
@estshy
Copy link

estshy commented Sep 13, 2018

+1

1 similar comment
@bramstroker
Copy link

+1

@zhaodanwjk
Copy link

Has this problem been solved? How?

@Chupaka
Copy link

Chupaka commented Feb 23, 2019

@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state.

@zhaodanwjk
Copy link

@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state.

Thank you very much. We can only wait until the new function is developed or another method is adopted to solve it.

@Chupaka
Copy link

Chupaka commented Feb 25, 2019

@zhaodanwjk To solve what? What do you expect? Isn't command=bash -c "runme.sh; sleep 3" (probably with startsecs=0) enough for you?

@zhaodanwjk
Copy link

@Chupaka Thank you very much, but I tried this method and could not solve my problem. I need to wait 100 seconds before restarting automatically when the service quits abnormally.

@jderusse
Copy link

@zhaodanwjk look at my previous comment #487 (comment)

@adrian-vlad
Copy link

A better workaround is to use a script that restarts fatal processes by configuring an eventlistener that receives PROCESS_STATE_FATAL events.

@icanhazstring
Copy link

icanhazstring commented Nov 4, 2021

I will link the PR for symfony/messenger here as they have addressed this using a wrapper script.
The difference for this wrapper script is that it passes along the SIGTERM signal to the child process which is better than adding a && sleep X which will be executed all the time.

https://github.com/symfony/symfony-docs/pull/13597/files#diff-cefce3d23e1437ef5885ad7d486ec8d57efd7a3f6ad55bd5997bb202b537ff59R650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests