fix: Warn when signal is handled #5186

holmanb · 2024-04-18T01:51:37Z

fix: Warn when a signal is handled

Cloud-init is expected to run to completion. It does not expect to be
stopped or restarted. Therefore, cloud-init should loudly warn when this
happens so that the user is aware that the unexpected has happened. 

Problems:

- When systemd stops cloud-init, no warning is logged.
- When systemd restarts cloud-init, status info and return code from the
  current stage is lost.

The net effect of these problems is that when cloud-init is restarted
no warning is logged, and cloud-init status indicates no problem.

Communicate unexpected state by:

- Increase signal handling log level to ERROR, add log to stderr.
- Write status.json before exiting when sys.exit() is called.
- Warn when status.json contains existing data when the stage
  starts[1].
- Append existing errors and recoverable status.json errors rather
  than overwriting[2].

This should make cloud-init properly warn in status output and logs that
something is amiss.

[1] This should never happen and indicates that something restarted the
    service. When cloud-init doesn't have the opportunity to handle a
    signal (SIGKILL), other mitigations will fail.
[2] Pre-existing warnings were previously silently lost from
    `cloud-init status`'s output.

Fixes GH-5190

Fixes #5190

CalvoM · 2024-04-25T06:45:49Z

Is there a way of reproducing the issue, so that testing the difference becomes easy?

TheRealFalcon · 2024-04-25T13:01:14Z

@CalvoM , I think you can do something like systemctl restart cloud-final.service while cloud-final service is running. Otherwise, kill -5 the cloud-init process while it is running.

TheRealFalcon

One major question inline

TheRealFalcon · 2024-04-25T21:24:59Z

cloudinit/cmd/main.py

@@ -764,6 +765,24 @@ def status_wrapper(name, args):

    v1 = status["v1"]
    v1["stage"] = mode
+    start_time = v1[mode]["start"]


I think we should drop most of this section. It gives us a bunch of recoverable errors, and it winds up giving a messy status (that's also shown twice?)

root@me:~# cloud-init status --long status: error extended_status: error - done boot_status_code: enabled-by-generator last_update: Thu, 25 Apr 2024 20:00:25 +0000 detail: DataSourceLXD errors: - sys.exit(1) called recoverable_errors: WARNING: - [modules-final] Unexpected value found in status.json. Was this stage restarted? - unexpected start_time: 1714075207.5654504 - unexpected errors: ['sys.exit(1) called'] - unexpected recoverable_errors: {'ERROR': ['Cloud-init 24.1.3 received SIGTERM, exiting...\n Filename: /usr/lib/python3.12/subprocess.py\n Function: _try_wait\n Line number: 2011\n Filename: /usr/lib/python3.12/subprocess.py\n Function: _wait\n Line number: 2053\n Filename: /usr/lib/python3.12/subprocess.py\n Function: wait\n Line number: 1264', 'failed stage modules-final']} - [modules-final] Unexpected value found in status.json. Was this stage restarted? - unexpected start_time: 1714075207.5654504 - unexpected errors: ['sys.exit(1) called'] - unexpected recoverable_errors: {'ERROR': ['Cloud-init 24.1.3 received SIGTERM, exiting...\n Filename: /usr/lib/python3.12/subprocess.py\n Function: _try_wait\n Line number: 2011\n Filename: /usr/lib/python3.12/subprocess.py\n Function: _wait\n Line number: 2053\n Filename: /usr/lib/python3.12/subprocess.py\n Function: wait\n Line number: 1264', 'failed stage modules-final']}

Way further down there's a section like this:

# merge new recoverable errors into exising recoverable error list old_recoverable_errors = v1[mode]["recoverable_errors"] new_recoverable_errors = next( filter( lambda h: isinstance(h, log.LogExporter), root_logger.handlers ) ).export_logs() for key in new_recoverable_errors.keys(): if key in old_recoverable_errors: old_recoverable_errors[key].extend(new_recoverable_errors[key]) else: old_recoverable_errors[key] = new_recoverable_errors[key]

Could we turn this section into it's own function, and then rather than

v1[mode]["recoverable_errors"] = next( filter(lambda h: isinstance(h, log.LogExporter), root_logger.handlers) ).export_logs()

we call the new function to add the new recoverable errors to the old ones? The first run of cloud-final will log that it got a sigterm. I would think that should give enough signal as to what happened rather than introducing new warnings that are less precise.

Oops, those duplicates definitely aren't desired.

The first run of cloud-final will log that it got a sigterm

That covers systemd's default behavior, yes, but what about if a SIGSTOP or SIGKILL happens?

If cloud-init receives a SIGSTOP or SIGKILL, our logging and cloud-init status tooling will not know that this happened. The purpose of this new section was to answer that problem.

That makes sense. So here, can we check that start time exists, but finished time does not (indicating a run that started but never finished), and then combine the errors/recoverable errors to the output of the next run?

holmanb · 2024-05-01T03:29:32Z

@TheRealFalcon I just pushed an update. Is that along the lines of what you had in mind?

Unittests are failing, but if this is what you were thinking I'll fix up the tests prior to merge.

TheRealFalcon

Yes, I think the general idea makes sense.

If cloud-final.service is restarted:

root@me:~# systemctl restart cloud-final.service
Job for cloud-final.service failed because the control process exited with error code.
See "systemctl status cloud-final.service" and "journalctl -xeu cloud-final.service" for details.
root@me:~# cloud-init status --long
status: error
extended_status: error - done
boot_status_code: enabled-by-generator
last_update: Thu, 02 May 2024 04:37:23 +0000
detail: DataSourceLXD
errors:
	- sys.exit(1) called
recoverable_errors:
ERROR:
	- Cloud-init 24.1.3 received SIGTERM, exiting...   Filename: /usr/lib/python3.12/subprocess.py   Function: _try_wait   Line number: 2011     Filename: /usr/lib/python3.12/subprocess.py     Function: _wait     Line number: 2053       Filename: /usr/lib/python3.12/subprocess.py       Function: wait       Line number: 1264
	- failed stage modules-final

vs being killed:

root@me:~# ps aux | grep cloud-init
root         623  0.0  0.0  50444 39840 ?        Ss   04:38   0:00 /usr/bin/python3 /usr/bin/cloud-init modules --mode=final
root         627  0.0  0.0   2800  1600 ?        S    04:38   0:00 /bin/sh -c tee -a /var/log/cloud-init-output.log
root         628  0.0  0.0   6120  1600 ?        S    04:38   0:00 tee -a /var/log/cloud-init-output.log
root         645  0.0  0.0   7080  1760 pts/1    S+   04:38   0:00 grep --color=auto cloud-init
root@me:~# kill -9 623
root@me:~# systemctl restart cloud-final.service
root@me:~# cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 02 May 2024 04:38:43 +0000
detail: DataSourceLXD
errors: []
recoverable_errors:
WARNING:
	- Unexpected start time for stage modules-final. Was this stage restarted?

TheRealFalcon · 2024-05-02T04:32:34Z

cloudinit/cmd/main.py

-    status = None
+    nullstatus = {
+        "errors": [],
+        "recoverable_errors": [],


Suggested change

"recoverable_errors": [],

"recoverable_errors": {},

Problems: - When systemd stops cloud-init, no warning is logged. - When systemd restarts cloud-init, status info and return code from the current stage is lost. The net effect of this is that when cloud-init is restarted, unless one explicitly checks the logs for the signal message, no warning is logged, and cloud-init status indicates no problem. Communicate unexpected state by: - Log an error when handling signals. - When sys.exit() is called from a signal, write status.json before exiting. - Append errors and recoverable errors rather than overwriting them. - Warn when status.json contains a time when the stage is starting[1] This should make cloud-init properly warn in status output and logs that something is amiss. [1] This is a bit cautious, however when cloud-init doesn't have the opportunity to handle a signal (SIGKILL), the other mitigations will fail.

TheRealFalcon

A few more minor issues inline

cloudinit/cmd/main.py

TheRealFalcon · 2024-05-03T14:16:27Z

cloudinit/cmd/main.py

+            LOG.exception("failed stage %s", mode)
+            print_exc("failed run of stage %s" % mode)
+            v1[mode]["errors"].append(f"sys.exit({str(e.code)}) called")
+        raise


Do we still want to raise if exit code is 0?

Good question - and I'm not sure myself.

Raising will cause cloud-init to exit more quickly, but there's not really much else that cloud-init does after this point in the code.

If we don't raise, I only see three more things that cloud-init will do before exiting:

write result.json (if in Final stage)

log completion of the stage and the time it took to complete("cloud-init mode %s")

flush reporting events

I lean slightly towards cloud-init doing the same behavior for both zero and non-zero exit modes. Prior to this change, none of these three were done when sys.exit() was called. Is there any harm in trying to write out these logs if they do happen? I assume that any service that systemd tries to restart with SIGTINT will just receive a bigger hammer after a timeout period to ensure that they actually restart, but I'm not certain of that.

I'm leaning towards not re-raising from the SystemExit handler at all. What do you think @TheRealFalcon?

tests/unittests/test_cli.py

holmanb · 2024-05-03T19:44:47Z

@TheRealFalcon I've addressed your remaining comments. Please let me know what you think of my comments regarding re-raising.

TheRealFalcon

Ship it!

TheRealFalcon

Ship it again!

holmanb marked this pull request as ready for review April 18, 2024 01:51

holmanb force-pushed the holmanb/check-for-signal-killed branch 3 times, most recently from 249968d to 76d926f Compare April 18, 2024 19:04

TheRealFalcon self-assigned this Apr 25, 2024

TheRealFalcon requested changes Apr 25, 2024

View reviewed changes

TheRealFalcon reviewed May 2, 2024

View reviewed changes

holmanb added 3 commits May 2, 2024 19:47

comments

4ac4a40

wip

5564cdf

holmanb force-pushed the holmanb/check-for-signal-killed branch from 147dfe2 to 5564cdf Compare May 3, 2024 01:58

tests

5bf5936

TheRealFalcon requested changes May 3, 2024

View reviewed changes

comments

a52ace3

holmanb requested a review from TheRealFalcon May 3, 2024 19:44

TheRealFalcon approved these changes May 3, 2024

View reviewed changes

holmanb added 2 commits May 3, 2024 14:42

don't re-raise on sys.exit()

0a6c38b

comments

e1608a9

TheRealFalcon approved these changes May 3, 2024

View reviewed changes

fix tests

e47d802

holmanb merged commit f24ecef into canonical:main May 5, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Warn when signal is handled #5186

fix: Warn when signal is handled #5186

holmanb commented Apr 18, 2024 •

edited

Loading

CalvoM commented Apr 25, 2024

TheRealFalcon commented Apr 25, 2024

TheRealFalcon left a comment

TheRealFalcon Apr 25, 2024

holmanb Apr 25, 2024 •

edited

Loading

TheRealFalcon Apr 26, 2024

holmanb commented May 1, 2024

TheRealFalcon left a comment

TheRealFalcon May 2, 2024

TheRealFalcon left a comment

TheRealFalcon May 3, 2024

holmanb May 3, 2024

holmanb May 3, 2024

holmanb commented May 3, 2024

TheRealFalcon left a comment

TheRealFalcon left a comment

fix: Warn when signal is handled #5186

fix: Warn when signal is handled #5186

Conversation

holmanb commented Apr 18, 2024 • edited Loading

CalvoM commented Apr 25, 2024

TheRealFalcon commented Apr 25, 2024

TheRealFalcon left a comment

Choose a reason for hiding this comment

TheRealFalcon Apr 25, 2024

Choose a reason for hiding this comment

holmanb Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

TheRealFalcon Apr 26, 2024

Choose a reason for hiding this comment

holmanb commented May 1, 2024

TheRealFalcon left a comment

Choose a reason for hiding this comment

TheRealFalcon May 2, 2024

Choose a reason for hiding this comment

TheRealFalcon left a comment

Choose a reason for hiding this comment

TheRealFalcon May 3, 2024

Choose a reason for hiding this comment

holmanb May 3, 2024

Choose a reason for hiding this comment

holmanb May 3, 2024

Choose a reason for hiding this comment

holmanb commented May 3, 2024

TheRealFalcon left a comment

Choose a reason for hiding this comment

TheRealFalcon left a comment

Choose a reason for hiding this comment

holmanb commented Apr 18, 2024 •

edited

Loading

holmanb Apr 25, 2024 •

edited

Loading