Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Warn when signal is handled #5186

Merged
merged 8 commits into from
May 5, 2024

Conversation

holmanb
Copy link
Member

@holmanb holmanb commented Apr 18, 2024

fix: Warn when a signal is handled

Cloud-init is expected to run to completion. It does not expect to be
stopped or restarted. Therefore, cloud-init should loudly warn when this
happens so that the user is aware that the unexpected has happened. 

Problems:

- When systemd stops cloud-init, no warning is logged.
- When systemd restarts cloud-init, status info and return code from the
  current stage is lost.

The net effect of these problems is that when cloud-init is restarted
no warning is logged, and cloud-init status indicates no problem.

Communicate unexpected state by:

- Increase signal handling log level to ERROR, add log to stderr.
- Write status.json before exiting when sys.exit() is called.
- Warn when status.json contains existing data when the stage
  starts[1].
- Append existing errors and recoverable status.json errors rather
  than overwriting[2].

This should make cloud-init properly warn in status output and logs that
something is amiss.

[1] This should never happen and indicates that something restarted the
    service. When cloud-init doesn't have the opportunity to handle a
    signal (SIGKILL), other mitigations will fail.
[2] Pre-existing warnings were previously silently lost from
    `cloud-init status`'s output.

Fixes GH-5190

Fixes #5190

@holmanb holmanb marked this pull request as ready for review April 18, 2024 01:51
@holmanb holmanb force-pushed the holmanb/check-for-signal-killed branch 3 times, most recently from 249968d to 76d926f Compare April 18, 2024 19:04
@CalvoM
Copy link
Contributor

CalvoM commented Apr 25, 2024

Is there a way of reproducing the issue, so that testing the difference becomes easy?

@TheRealFalcon
Copy link
Member

@CalvoM , I think you can do something like systemctl restart cloud-final.service while cloud-final service is running. Otherwise, kill -5 the cloud-init process while it is running.

@TheRealFalcon TheRealFalcon self-assigned this Apr 25, 2024
Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major question inline

@@ -764,6 +765,24 @@ def status_wrapper(name, args):

v1 = status["v1"]
v1["stage"] = mode
start_time = v1[mode]["start"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should drop most of this section. It gives us a bunch of recoverable errors, and it winds up giving a messy status (that's also shown twice?)

root@me:~# cloud-init status --long
status: error
extended_status: error - done
boot_status_code: enabled-by-generator
last_update: Thu, 25 Apr 2024 20:00:25 +0000
detail: DataSourceLXD
errors:
	- sys.exit(1) called
recoverable_errors:
WARNING:
	- [modules-final] Unexpected value found in status.json. Was this stage restarted?
	- unexpected start_time: 1714075207.5654504
	- unexpected errors: ['sys.exit(1) called']
	- unexpected recoverable_errors: {'ERROR': ['Cloud-init 24.1.3 received SIGTERM, exiting...\n  Filename: /usr/lib/python3.12/subprocess.py\n  Function: _try_wait\n  Line number: 2011\n    Filename: /usr/lib/python3.12/subprocess.py\n    Function: _wait\n    Line number: 2053\n      Filename: /usr/lib/python3.12/subprocess.py\n      Function: wait\n      Line number: 1264', 'failed stage modules-final']}
	- [modules-final] Unexpected value found in status.json. Was this stage restarted?
	- unexpected start_time: 1714075207.5654504
	- unexpected errors: ['sys.exit(1) called']
	- unexpected recoverable_errors: {'ERROR': ['Cloud-init 24.1.3 received SIGTERM, exiting...\n  Filename: /usr/lib/python3.12/subprocess.py\n  Function: _try_wait\n  Line number: 2011\n    Filename: /usr/lib/python3.12/subprocess.py\n    Function: _wait\n    Line number: 2053\n      Filename: /usr/lib/python3.12/subprocess.py\n      Function: wait\n      Line number: 1264', 'failed stage modules-final']}

Way further down there's a section like this:

        # merge new recoverable errors into exising recoverable error list
        old_recoverable_errors = v1[mode]["recoverable_errors"]
        new_recoverable_errors = next(
            filter(
                lambda h: isinstance(h, log.LogExporter), root_logger.handlers
            )
        ).export_logs()
        for key in new_recoverable_errors.keys():
            if key in old_recoverable_errors:
                old_recoverable_errors[key].extend(new_recoverable_errors[key])
            else:
                old_recoverable_errors[key] = new_recoverable_errors[key]

Could we turn this section into it's own function, and then rather than

    v1[mode]["recoverable_errors"] = next(
        filter(lambda h: isinstance(h, log.LogExporter), root_logger.handlers)
    ).export_logs()

we call the new function to add the new recoverable errors to the old ones? The first run of cloud-final will log that it got a sigterm. I would think that should give enough signal as to what happened rather than introducing new warnings that are less precise.

Copy link
Member Author

@holmanb holmanb Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, those duplicates definitely aren't desired.

The first run of cloud-final will log that it got a sigterm

That covers systemd's default behavior, yes, but what about if a SIGSTOP or SIGKILL happens?

If cloud-init receives a SIGSTOP or SIGKILL, our logging and cloud-init status tooling will not know that this happened. The purpose of this new section was to answer that problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. So here, can we check that start time exists, but finished time does not (indicating a run that started but never finished), and then combine the errors/recoverable errors to the output of the next run?

@holmanb
Copy link
Member Author

holmanb commented May 1, 2024

@TheRealFalcon I just pushed an update. Is that along the lines of what you had in mind?

Unittests are failing, but if this is what you were thinking I'll fix up the tests prior to merge.

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the general idea makes sense.

If cloud-final.service is restarted:

root@me:~# systemctl restart cloud-final.service
Job for cloud-final.service failed because the control process exited with error code.
See "systemctl status cloud-final.service" and "journalctl -xeu cloud-final.service" for details.
root@me:~# cloud-init status --long
status: error
extended_status: error - done
boot_status_code: enabled-by-generator
last_update: Thu, 02 May 2024 04:37:23 +0000
detail: DataSourceLXD
errors:
	- sys.exit(1) called
recoverable_errors:
ERROR:
	- Cloud-init 24.1.3 received SIGTERM, exiting...   Filename: /usr/lib/python3.12/subprocess.py   Function: _try_wait   Line number: 2011     Filename: /usr/lib/python3.12/subprocess.py     Function: _wait     Line number: 2053       Filename: /usr/lib/python3.12/subprocess.py       Function: wait       Line number: 1264
	- failed stage modules-final

vs being killed:

root@me:~# ps aux | grep cloud-init
root         623  0.0  0.0  50444 39840 ?        Ss   04:38   0:00 /usr/bin/python3 /usr/bin/cloud-init modules --mode=final
root         627  0.0  0.0   2800  1600 ?        S    04:38   0:00 /bin/sh -c tee -a /var/log/cloud-init-output.log
root         628  0.0  0.0   6120  1600 ?        S    04:38   0:00 tee -a /var/log/cloud-init-output.log
root         645  0.0  0.0   7080  1760 pts/1    S+   04:38   0:00 grep --color=auto cloud-init
root@me:~# kill -9 623
root@me:~# systemctl restart cloud-final.service
root@me:~# cloud-init status --long
status: done
extended_status: degraded done
boot_status_code: enabled-by-generator
last_update: Thu, 02 May 2024 04:38:43 +0000
detail: DataSourceLXD
errors: []
recoverable_errors:
WARNING:
	- Unexpected start time for stage modules-final. Was this stage restarted?

status = None
nullstatus = {
"errors": [],
"recoverable_errors": [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"recoverable_errors": [],
"recoverable_errors": {},

Problems:

- When systemd stops cloud-init, no warning is logged.
- When systemd restarts cloud-init, status info and return code from the
  current stage is lost.

The net effect of this is that when cloud-init is restarted, unless one
explicitly checks the logs for the signal message, no warning is logged,
and cloud-init status indicates no problem.

Communicate unexpected state by:

- Log an error when handling signals.
- When sys.exit() is called from a signal, write status.json before exiting.
- Append errors and recoverable errors rather than overwriting them.
- Warn when status.json contains a time when the stage is starting[1]

This should make cloud-init properly warn in status output and logs that
something is amiss.

[1] This is a bit cautious, however when cloud-init doesn't have the
    opportunity to handle a signal (SIGKILL), the other mitigations will fail.
@holmanb holmanb force-pushed the holmanb/check-for-signal-killed branch from 147dfe2 to 5564cdf Compare May 3, 2024 01:58
Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor issues inline

cloudinit/cmd/main.py Outdated Show resolved Hide resolved
LOG.exception("failed stage %s", mode)
print_exc("failed run of stage %s" % mode)
v1[mode]["errors"].append(f"sys.exit({str(e.code)}) called")
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to raise if exit code is 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - and I'm not sure myself.

Raising will cause cloud-init to exit more quickly, but there's not really much else that cloud-init does after this point in the code.

If we don't raise, I only see three more things that cloud-init will do before exiting:

  1. write result.json (if in Final stage)
  2. log completion of the stage and the time it took to complete("cloud-init mode %s")
  3. flush reporting events

I lean slightly towards cloud-init doing the same behavior for both zero and non-zero exit modes. Prior to this change, none of these three were done when sys.exit() was called. Is there any harm in trying to write out these logs if they do happen? I assume that any service that systemd tries to restart with SIGTINT will just receive a bigger hammer after a timeout period to ensure that they actually restart, but I'm not certain of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards not re-raising from the SystemExit handler at all. What do you think @TheRealFalcon?

tests/unittests/test_cli.py Outdated Show resolved Hide resolved
@holmanb
Copy link
Member Author

holmanb commented May 3, 2024

@TheRealFalcon I've addressed your remaining comments. Please let me know what you think of my comments regarding re-raising.

@holmanb holmanb requested a review from TheRealFalcon May 3, 2024 19:44
Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it again!

@holmanb holmanb merged commit f24ecef into canonical:main May 5, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud-init does not warn when restarted
3 participants