feature request: delay restart before status FATAL #487

guettli · 2014-09-05T10:15:57Z

Use case:

during a software update the server can't restart since some needed python modules are not available yet. Supervisor retries N times (N comes from config. AFAIK defaults to 3).

After N fails the state FATAL gets entered.

In my use case the programm entered state FATAL although just some seconds later the restart would have been successfull.

Can you understand this use case?

A possible solution would be to use a delay in unit "seconds": After N failed retries wait M seconds and then try again. In my case this should be done to infinity (endless loop).

mgalgs · 2014-12-05T22:54:29Z

👍

A delaybetweenretries parameter would be useful (bonus points for exponential backoff)

aztlan2k · 2015-01-19T20:08:01Z

👍 this would be really useful.

allowing us to configure how long between retries would be great. Allowing for an backoff would also help.

i think it would be great if we could also specify a retry count of infinity (never give up) in combination with a sane value for the delay_between_retries.

mmh · 2015-01-27T07:55:51Z

+1, would be useful.

willybarro · 2015-02-11T19:43:11Z

👍 would be great. There's already a PR with that feature #509, but it's not automergeable, and there's no discussion on it.

mrook · 2015-03-10T10:14:06Z

+1

nicwest · 2015-03-18T17:16:35Z

+1

arinto · 2015-03-30T08:03:03Z

+1

nereusz · 2015-04-22T12:41:12Z

+1 This would be very useful for me

CheatCodes · 2015-04-22T19:12:30Z

+1

ghost · 2015-05-21T12:54:48Z

+1

jsmirnov · 2015-05-23T07:52:09Z

+1

guettli · 2015-05-23T18:36:17Z

Has someone enough knowledge to create patch for this?

dbpolito · 2015-05-29T20:28:51Z

I would love to see this too, so 👍

detailyang · 2015-06-01T14:20:32Z

+1

oryband · 2015-06-08T09:06:20Z

👍

vincent-io · 2015-06-12T10:58:40Z

Would love to have this feature. 👍

siavashs · 2015-07-12T10:16:33Z

👍

guettli · 2015-07-13T15:33:00Z

Why was this ticket closed? Was the feature request implemented? I can't see a code change on this issue or on #561.

tonicospinelli · 2015-07-13T19:16:11Z

+1

oryband · 2015-07-14T07:48:55Z

dear sir plz implement this alrdy k thx bai

darkone23 · 2015-08-03T22:51:36Z

👍 would like this feature, thanks!

0x20h · 2015-08-04T14:58:11Z

+1, would be useful for me too!

caorong · 2015-08-18T15:22:19Z

+1, hope this!

conanfanli · 2015-08-28T20:56:03Z

+1

dieend · 2015-08-31T06:52:08Z

+1

miso-belica · 2015-09-01T14:01:21Z

+1 for delay_between_retries and also exponential backoff would be great.

pfuender · 2015-09-16T08:02:20Z

+1

toastbrotch · 2015-09-16T08:02:29Z

+1

cbj4074 · 2017-11-14T22:16:09Z

Several others have mentioned a "backoff" implementation and the BACKOFF state, but only 0x20h mentioned this vital clue (though, I wish he'd been more explicit):

The mechanism that implements startretries does employ a backoff strategy that increases the delay by 1 second with each attempt.

In other words, you can set startretries to a value that is large enough to cover any "expected downtime" in your workflow without causing a self-inflicted DoS scenario.

While it might be nice to have control over the backoff computation, I concur that this has been addressed in as much as a self-inflicted DoS will not result from setting startretries too generously.

nvictor · 2017-12-04T16:21:16Z

can anyone points to where in the docs what 0x20h describes (the "add an increasing delay for each retry") is described? thanks.

cbj4074 · 2017-12-04T23:47:45Z

@nvictor If it were documented, I doubt any of us would be here. ;) I discovered that the delay is increased by one second with each retry by examining supervisord's log entries.

rotorsolutions · 2018-01-02T08:36:00Z

@nvictor @vlsd Has already pointed to this documentation:

http://supervisord.org/subprocess.html#process-states

Each start retry will take progressively more time.

I was also looking for a solution for my problem. I connect to an IMAP server and get a connection timeout. Starting the script after a little delay works great, but supervisor is too quick to restart (for this job).

Therefor a delay as a configuration option would be great to prevent hacks like the 'sleep' option mentioned above (which I will try now as there is not configurable alternative)

vlsd · 2018-01-03T21:16:30Z

So it seems like linear backoff is implemented only (exponential backoff is the other common kind), and only at one, hard-coded rate (1s per retry, as per @cbj4074, but not actually documented). I no longer have a horse in the game. It seems to me like providing the users with an ability to both switch between linear and exponential backoff and set the rate at which the backoff happens would be an ideal solution here. Failing that, documenting that the backoff is linear and set at a rate of 1 second per retry is also a solution. Simply mentioning that backoff happens, with no other details, is confusing.

BibhuGlobussoft · 2018-02-01T05:57:09Z

I am having the FATAL state issue due to which my supervisor goes down even if the supervisor is running but the workers not processing any jobs.
Did any body have any solution for this ?

virusdefender · 2018-06-06T12:33:05Z

Thanks @jderusse

import hashlib
import os
import time
import sys
import datetime

max_backoff = 5
timeout = 60 * 60 * 6

def log(msg):
    print("[proc_wrapper %s] %s" % (str(datetime.datetime.now()), msg))

command = " ".join(sys.argv[1:])
backoff_file = hashlib.md5(command.encode("utf-8")).hexdigest()
log("Running '%s', backoff file name: %s" % (command, backoff_file))

status = os.system(command)
if status != 0:
    if not os.path.exists(backoff_file):
        seconds = -1
    else:
        with open(backoff_file) as f:
            content = f.read().split(":")
            seconds, last_timestamp = int(content[0]), int(content[1])
        if time.time() - last_timestamp > timeout:
            seconds = -1
    if seconds + 1 > max_backoff:
        seconds = max_backoff
    else:
        seconds = seconds + 1
        with open(backoff_file, "w") as f:
            f.write("%d:%d" % (seconds, int(time.time())))
    log("Command '%s' exited with status %d, sleep %ds" % (command, status, seconds))
    time.sleep(seconds)
    exit(1)
else:
    try:
        os.remove(backoff_file)
    except Exception:
        pass

I write a new script to implement this function

then add

killasgroup = true
stopasgroup = true

to [program: x] section

yongzhang · 2018-06-15T08:35:23Z

+1

As explained Supervisor/supervisor#487 (comment), `Supervisord` > The mechanism that implements startretries does employ a backoff > strategy that increases the delay by 1 second with each attempt. > In other words, you can set startretries to a value that is large > enough to cover any "expected downtime" in your workflow without > causing a self-inflicted DoS scenario. I have tested by putting a `sleep 20` in the wait-for-postgres.sh wrapper on the backend, to delay its startup. This makes the client still available to finally register when the backend comes up. (default value for `startretries` is 3).

estshy · 2018-09-13T06:51:26Z

+1

bramstroker · 2018-12-14T09:00:41Z

+1

zhaodanwjk · 2019-02-22T02:38:10Z

Has this problem been solved? How？

Chupaka · 2019-02-23T17:33:57Z

@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state.

zhaodanwjk · 2019-02-25T06:22:16Z

@zhaodanwjk current behaviour is the following: on first failure - restart after 1 second, on second failure - restart after 2 seconds, then 3 seconds, 4 seconds, etc. Until FATAL state.

Thank you very much. We can only wait until the new function is developed or another method is adopted to solve it.

Chupaka · 2019-02-25T12:36:09Z

@zhaodanwjk To solve what? What do you expect? Isn't command=bash -c "runme.sh; sleep 3" (probably with startsecs=0) enough for you?

zhaodanwjk · 2019-02-27T02:40:33Z

@Chupaka Thank you very much, but I tried this method and could not solve my problem. I need to wait 100 seconds before restarting automatically when the service quits abnormally.

jderusse · 2019-02-27T08:20:17Z

@zhaodanwjk look at my previous comment #487 (comment)

adrian-vlad · 2020-03-02T21:11:03Z

A better workaround is to use a script that restarts fatal processes by configuring an eventlistener that receives PROCESS_STATE_FATAL events.

icanhazstring · 2021-11-04T13:13:37Z

I will link the PR for symfony/messenger here as they have addressed this using a wrapper script.
The difference for this wrapper script is that it passes along the SIGTERM signal to the child process which is better than adding a && sleep X which will be executed all the time.

https://github.com/symfony/symfony-docs/pull/13597/files#diff-cefce3d23e1437ef5885ad7d486ec8d57efd7a3f6ad55bd5997bb202b537ff59R650

vincent-io mentioned this issue Jun 18, 2015

Option to delay subsequent retries after unsuccessful start #561

Closed

miso-belica mentioned this issue Sep 1, 2015

Added a "restartpause" option #509

Closed

mnaberez mentioned this issue Jun 12, 2018

supervisord giving up to start proces is wrong only checking startretries without time duration #1109

Closed

mnaberez mentioned this issue Dec 28, 2018

Delay between process starts or restarts? #1189

Closed

mnaberez mentioned this issue Feb 21, 2019

Wait a few minutes when restarting automatically #1207

Closed

guybrand mentioned this issue Mar 19, 2019

delay between retries ochinchina/supervisord#131

Open

mnaberez mentioned this issue Aug 23, 2019

Can I set a limit on the number of restart retries? #1199

Closed

mnaberez mentioned this issue Mar 12, 2020

Feature request: start delay and dependencies #1336

Closed

alejandrosame mentioned this issue Mar 28, 2020

Add supervisor for automatic server restart OpenMined/opus#25

Closed

jderusse mentioned this issue Apr 29, 2020

Messenger process managers symfony/symfony-docs#13597

Merged

decentral1se mentioned this issue Jan 9, 2022

Increase restart attempt threshold for proxy? snikket-im/snikket-web-proxy#11

Open

yetHandsome mentioned this issue May 27, 2022

feature request:When the process becomes FATAL state, can there be a configuration to control it, how many seconds to restart the task in FATAL state #1509

Closed

kltm mentioned this issue Jun 5, 2022

Explore why Barista is having instability and restart issues geneontology/noctua#775

Closed

feature request: delay restart before status FATAL #487

feature request: delay restart before status FATAL #487

Comments

guettli commented Sep 5, 2014

mgalgs commented Dec 5, 2014

aztlan2k commented Jan 19, 2015

mmh commented Jan 27, 2015

willybarro commented Feb 11, 2015

mrook commented Mar 10, 2015

nicwest commented Mar 18, 2015

arinto commented Mar 30, 2015

nereusz commented Apr 22, 2015

CheatCodes commented Apr 22, 2015

ghost commented May 21, 2015

jsmirnov commented May 23, 2015

guettli commented May 23, 2015

dbpolito commented May 29, 2015

detailyang commented Jun 1, 2015

oryband commented Jun 8, 2015

vincent-io commented Jun 12, 2015

siavashs commented Jul 12, 2015

guettli commented Jul 13, 2015

tonicospinelli commented Jul 13, 2015

oryband commented Jul 14, 2015

darkone23 commented Aug 3, 2015

0x20h commented Aug 4, 2015

caorong commented Aug 18, 2015

conanfanli commented Aug 28, 2015

dieend commented Aug 31, 2015

miso-belica commented Sep 1, 2015

pfuender commented Sep 16, 2015

toastbrotch commented Sep 16, 2015

cbj4074 commented Nov 14, 2017

nvictor commented Dec 4, 2017 • edited Loading

cbj4074 commented Dec 4, 2017

rotorsolutions commented Jan 2, 2018

vlsd commented Jan 3, 2018

BibhuGlobussoft commented Feb 1, 2018

virusdefender commented Jun 6, 2018 • edited Loading

yongzhang commented Jun 15, 2018

estshy commented Sep 13, 2018

bramstroker commented Dec 14, 2018

zhaodanwjk commented Feb 22, 2019

Chupaka commented Feb 23, 2019

zhaodanwjk commented Feb 25, 2019

Chupaka commented Feb 25, 2019 • edited Loading

zhaodanwjk commented Feb 27, 2019

jderusse commented Feb 27, 2019

adrian-vlad commented Mar 2, 2020

icanhazstring commented Nov 4, 2021 • edited Loading

nvictor commented Dec 4, 2017 •

edited

Loading

virusdefender commented Jun 6, 2018 •

edited

Loading

Chupaka commented Feb 25, 2019 •

edited

Loading

icanhazstring commented Nov 4, 2021 •

edited

Loading