-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck #5930
Comments
Yes I believe the upstream commit breaks RHEL/CentOS because of the cyclic dependency between cloud-init.service and sshd.service. One cannot restart sshd while starting cloud-init.service. It needs to be fixed in whichever way upstream finds appropriate. This fix should also be backported to all stable upstream cloud-init versions 24.2 and above. |
@TheRealFalcon @holmanb Please consider this as high priority. I think this affects most/all systemd driven distros. If no proper fix can be found, I think its ok to revert the change as it will have least colateral. |
I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.
Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually ( I don't think that this is actually a high priority for a number of reasons:
No, reverting has downsides too. A code fix appears simple, I'll propose something.
Upstream cloud-init doesn't backport fixes to old upstream releases. We occasionally cherry-pick patches from main into the latest release branch. Once 24.4 is released there is no benefit for upstream to support 24.3 since any fixes for 24.3 exist in a newer release. |
@ani-sinha @zhaohuijuan Please test #5935 |
Why only for manual invocation ?
Why?
So there is no concept of LTS I guess ... |
hmm, OK I see your comment
Yes, it can happen when sshd is already started as a part of the boot up sequence but we manually again restart cloud-init after the fact later. But it can also happen when users install cloud-init rpm and start cloud-init on a system that did not have cloud-init rpm installed (which is how I reproduced this issue). This case is I think quite common and should not be ignored. We did see this issue in 24.4 so seems you are wrong when you say:
|
Thanks @ani-sinha and @holmanb for looking at this.
Actually it is easy to reproduce it manually:
Although this is a corner case, but the cloud-init is stuck when hit it, the impact is bad, I think it is worthwhile to fix it. And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests, and hit this issue after upgrade to 24.4, I think it is a regression and worthwhile to fix. |
Yes this is the situation where the image did not come with cloud-init and the user installs the rpms and starts cloud-init later.
I would not call it a corner case.
Yes, the way this breaks in a bad way and needs to be fixed.
|
Oh we did revert single process optimization patch and |
We teated this and I reviewed your PR. The quote around
or add an additional argument to
|
Yes, and regression tests are PASS on OpenStack/Azure |
Correct. Cloud-init upstream has no LTS guarantees. We communicate breaking changes in the docs to assist downstreams that want to provide long term stability.
Common for who? This certainly isn't (and shouldn't be) common for users. Cloud-init is used for runtime customization. Anybody doing image build should install it into a pristine image and never start the service before first launch. Starting it manually before launching the image would leave behind artifacts and produce a dirty image.
Again, if this were a common end user use-case then I would have expected user bug reports on Ubuntu closer to 4 months ago when 24.2 was released in Ubuntu. This has been well-exercised by users since then - and no user reports.
See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior. What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs? I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.
Agreed. As you noted, reverting the single process change would cause this deadlock to happen on newer versions. |
@zhaohuijuan @ani-sinha Thank you for testing. I applied the requested changes to the PR and fixed up tests accordingly. |
Does Ubuntu use sysyemd? |
Yes |
A test: Cloud-init is fundamentally different. It is not a daemon that provides external services. It is a one-shot application that configures the system on startup. I don't think that this kind of test provides useful signal of a realistic user workflow. As for the single process change: Users can still manually run cloud-init if they want to (using the CLI). A different set of commands may be required for anybody using It seems to me like the primary reason to make this change is to satisfy a CI test that doesn't represent a real user requirement and will need to be changed in the future anyways. Please correct me if I'm wrong. |
Agree with here, and we already updated our CI when hit this issue.
Yes, but for the primary reason of the fix, I think it should not stuck anyway, even this is unusual case or corner case. So I think the fix is worthwhile. Thanks @holmanb for the fix and quick response. |
Great, thanks.
It looks like @TheRealFalcon beat me to it.
Welcome! |
Fixed in #5935 |
Hi @holmanb, @TheRealFalcon Will there a be a 24.4.1 release with this fix? |
@sshedi , we are not planning on releasing a 24.4.1. It doesn't make sense from an upstream perspective as this bug isn't currently possible unless you're already patching the upstream code, and it isn't a use case we can continue to support. |
Thanks @TheRealFalcon, makes sense. |
Bug report
There is dead-lock between cloud-init.service and sshd.service in cloud-init-24.4, which causes the cloud-init stuck when start the cloud-init.service directly.
No such issue in cloud-init-24.1.4
This issue was caused by this commit, which moved the module set_passwords to cloud-init stage.
In the module set_passwords, it will restart sshd if the sshd service is running. And cloud-init.service has condition "Before=sshd.service", so after moving the set_passwords to cloud-init stage, when sshd is running and start cloud-init.service directly, cloud-init will restart sshd, after sshd is down, the cloud-init will be stuck as the "Before=sshd.service".
$ cat /etc/cloud/cloud.cfg
......
cloud_init_modules:
......
$ cat ./usr/lib/systemd/system/cloud-init.service
[Unit]
Description=Cloud-init: Network Stage
Wants=cloud-init-local.service
Wants=sshd-keygen.service
Wants=sshd.service
After=cloud-init-local.service
After=systemd-networkd-wait-online.service
After=NetworkManager.service
After=NetworkManager-wait-online.service
Before=network-online.target
Before=sshd-keygen.service
Before=sshd.service
......
Steps to reproduce the problem
$ systemctl status sshd
● sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: enabled)
Active: active (running) since Fri 2024-12-13 22:45:10 CST; 1min 25s ago
...
$ systemctl start cloud-init
-------------------
821 2024-12-12 02:06:46,008 - modules.py[DEBUG]: Running module set_passwords (<module 'cloudinit.config.cc_set_passwords' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_set_passwords.py'>) with f requency once-per-instance
822 2024-12-12 02:06:46,008 - handlers.py[DEBUG]: start: init-network/config-set_passwords: running config-set_passwords with frequency once-per-instance
823 2024-12-12 02:06:46,008 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords - wb: [644] 25 bytes
824 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
825 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
826 2024-12-12 02:06:46,009 - helpers.py[DEBUG]: Running config-set_passwords using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords'>)
827 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
828 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
829 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
830 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
831 2024-12-12 02:06:46,010 - ssh_util.py[DEBUG]: line 131: option PasswordAuthentication added with no
832 2024-12-12 02:06:46,010 - util.py[DEBUG]: Writing to /etc/ssh/sshd_config - wb: [600] 3680 bytes
833 2024-12-12 02:06:46,010 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
834 2024-12-12 02:06:46,011 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
835 2024-12-12 02:06:46,011 - subp.py[DEBUG]: Running command ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
836 2024-12-12 02:06:46,022 - performance.py[DEBUG]: Running ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] took 0.011 seconds
-------------------
Environment details
cloud-init logs
-------------------
821 2024-12-12 02:06:46,008 - modules.py[DEBUG]: Running module set_passwords (<module 'cloudinit.config.cc_set_passwords' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_set_passwords.py'>) with f requency once-per-instance
822 2024-12-12 02:06:46,008 - handlers.py[DEBUG]: start: init-network/config-set_passwords: running config-set_passwords with frequency once-per-instance
823 2024-12-12 02:06:46,008 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords - wb: [644] 25 bytes
824 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
825 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
826 2024-12-12 02:06:46,009 - helpers.py[DEBUG]: Running config-set_passwords using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords'>)
827 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
828 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
829 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
830 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
831 2024-12-12 02:06:46,010 - ssh_util.py[DEBUG]: line 131: option PasswordAuthentication added with no
832 2024-12-12 02:06:46,010 - util.py[DEBUG]: Writing to /etc/ssh/sshd_config - wb: [600] 3680 bytes
833 2024-12-12 02:06:46,010 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
834 2024-12-12 02:06:46,011 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
835 2024-12-12 02:06:46,011 - subp.py[DEBUG]: Running command ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
836 2024-12-12 02:06:46,022 - performance.py[DEBUG]: Running ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] took 0.011 seconds
-------------------
Additional info
The text was updated successfully, but these errors were encountered: