-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Warn when signal is handled #5186
fix: Warn when signal is handled #5186
Conversation
249968d
to
76d926f
Compare
Is there a way of reproducing the issue, so that testing the difference becomes easy? |
@CalvoM , I think you can do something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One major question inline
cloudinit/cmd/main.py
Outdated
@@ -764,6 +765,24 @@ def status_wrapper(name, args): | |||
|
|||
v1 = status["v1"] | |||
v1["stage"] = mode | |||
start_time = v1[mode]["start"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should drop most of this section. It gives us a bunch of recoverable errors, and it winds up giving a messy status (that's also shown twice?)
root@me:~# cloud-init status --long
status: error
extended_status: error - done
boot_status_code: enabled-by-generator
last_update: Thu, 25 Apr 2024 20:00:25 +0000
detail: DataSourceLXD
errors:
- sys.exit(1) called
recoverable_errors:
WARNING:
- [modules-final] Unexpected value found in status.json. Was this stage restarted?
- unexpected start_time: 1714075207.5654504
- unexpected errors: ['sys.exit(1) called']
- unexpected recoverable_errors: {'ERROR': ['Cloud-init 24.1.3 received SIGTERM, exiting...\n Filename: /usr/lib/python3.12/subprocess.py\n Function: _try_wait\n Line number: 2011\n Filename: /usr/lib/python3.12/subprocess.py\n Function: _wait\n Line number: 2053\n Filename: /usr/lib/python3.12/subprocess.py\n Function: wait\n Line number: 1264', 'failed stage modules-final']}
- [modules-final] Unexpected value found in status.json. Was this stage restarted?
- unexpected start_time: 1714075207.5654504
- unexpected errors: ['sys.exit(1) called']
- unexpected recoverable_errors: {'ERROR': ['Cloud-init 24.1.3 received SIGTERM, exiting...\n Filename: /usr/lib/python3.12/subprocess.py\n Function: _try_wait\n Line number: 2011\n Filename: /usr/lib/python3.12/subprocess.py\n Function: _wait\n Line number: 2053\n Filename: /usr/lib/python3.12/subprocess.py\n Function: wait\n Line number: 1264', 'failed stage modules-final']}
Way further down there's a section like this:
# merge new recoverable errors into exising recoverable error list
old_recoverable_errors = v1[mode]["recoverable_errors"]
new_recoverable_errors = next(
filter(
lambda h: isinstance(h, log.LogExporter), root_logger.handlers
)
).export_logs()
for key in new_recoverable_errors.keys():
if key in old_recoverable_errors:
old_recoverable_errors[key].extend(new_recoverable_errors[key])
else:
old_recoverable_errors[key] = new_recoverable_errors[key]
Could we turn this section into it's own function, and then rather than
v1[mode]["recoverable_errors"] = next(
filter(lambda h: isinstance(h, log.LogExporter), root_logger.handlers)
).export_logs()
we call the new function to add the new recoverable errors to the old ones? The first run of cloud-final will log that it got a sigterm. I would think that should give enough signal as to what happened rather than introducing new warnings that are less precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, those duplicates definitely aren't desired.
The first run of cloud-final will log that it got a sigterm
That covers systemd's default behavior, yes, but what about if a SIGSTOP or SIGKILL happens?
If cloud-init receives a SIGSTOP or SIGKILL, our logging and cloud-init status
tooling will not know that this happened. The purpose of this new section was to answer that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. So here, can we check that start time exists, but finished time does not (indicating a run that started but never finished), and then combine the errors/recoverable errors to the output of the next run?
@TheRealFalcon I just pushed an update. Is that along the lines of what you had in mind? Unittests are failing, but if this is what you were thinking I'll fix up the tests prior to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the general idea makes sense.
If cloud-final.service
is restarted:
root@me:~# systemctl restart cloud-final.service
Job for cloud-final.service failed because the control process exited with error code.
See "systemctl status cloud-final.service" and "journalctl -xeu cloud-final.service" for details.
root@me:~# cloud-init status --long
status: error
extended_status: error - done
boot_status_code: enabled-by-generator
last_update: Thu, 02 May 2024 04:37:23 +0000
detail: DataSourceLXD
errors:
- sys.exit(1) called
recoverable_errors:
ERROR:
- Cloud-init 24.1.3 received SIGTERM, exiting... Filename: /usr/lib/python3.12/subprocess.py Function: _try_wait Line number: 2011 Filename: /usr/lib/python3.12/subprocess.py Function: _wait Line number: 2053 Filename: /usr/lib/python3.12/subprocess.py Function: wait Line number: 1264
- failed stage modules-final
vs being killed:
root@me:~# ps aux | grep cloud-init
root 623 0.0 0.0 50444 39840 ? Ss 04:38 0:00 /usr/bin/python3 /usr/bin/cloud-init modules --mode=final
root 627 0.0 0.0 2800 1600 ? S 04:38 0:00 /bin/sh -c tee -a /var/log/cloud-init-output.log
root 628 0.0 0.0 6120 1600 ? S 04:38 0:00 tee -a /var/log/cloud-init-output.log
root 645 0.0 0.0 7080 1760 pts/1 S+ 04:38 0:00 grep --color=auto cloud-init
root@me:~# kill -9 623
root@me:~# systemctl restart cloud-final.service
root@me:~# cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 02 May 2024 04:38:43 +0000
detail: DataSourceLXD
errors: []
recoverable_errors:
WARNING:
- Unexpected start time for stage modules-final. Was this stage restarted?
cloudinit/cmd/main.py
Outdated
status = None | ||
nullstatus = { | ||
"errors": [], | ||
"recoverable_errors": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"recoverable_errors": [], | |
"recoverable_errors": {}, |
Problems: - When systemd stops cloud-init, no warning is logged. - When systemd restarts cloud-init, status info and return code from the current stage is lost. The net effect of this is that when cloud-init is restarted, unless one explicitly checks the logs for the signal message, no warning is logged, and cloud-init status indicates no problem. Communicate unexpected state by: - Log an error when handling signals. - When sys.exit() is called from a signal, write status.json before exiting. - Append errors and recoverable errors rather than overwriting them. - Warn when status.json contains a time when the stage is starting[1] This should make cloud-init properly warn in status output and logs that something is amiss. [1] This is a bit cautious, however when cloud-init doesn't have the opportunity to handle a signal (SIGKILL), the other mitigations will fail.
147dfe2
to
5564cdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor issues inline
cloudinit/cmd/main.py
Outdated
LOG.exception("failed stage %s", mode) | ||
print_exc("failed run of stage %s" % mode) | ||
v1[mode]["errors"].append(f"sys.exit({str(e.code)}) called") | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still want to raise if exit code is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - and I'm not sure myself.
Raising will cause cloud-init to exit more quickly, but there's not really much else that cloud-init does after this point in the code.
If we don't raise, I only see three more things that cloud-init will do before exiting:
- write result.json (if in Final stage)
- log completion of the stage and the time it took to complete("cloud-init mode %s")
- flush reporting events
I lean slightly towards cloud-init doing the same behavior for both zero and non-zero exit modes. Prior to this change, none of these three were done when sys.exit()
was called. Is there any harm in trying to write out these logs if they do happen? I assume that any service that systemd tries to restart with SIGTINT will just receive a bigger hammer after a timeout period to ensure that they actually restart, but I'm not certain of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards not re-raising from the SystemExit
handler at all. What do you think @TheRealFalcon?
@TheRealFalcon I've addressed your remaining comments. Please let me know what you think of my comments regarding re-raising. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it again!
Fixes #5190