There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck #5930

zhaohuijuan · 2024-12-14T16:00:41Z

Bug report

There is dead-lock between cloud-init.service and sshd.service in cloud-init-24.4, which causes the cloud-init stuck when start the cloud-init.service directly.
No such issue in cloud-init-24.1.4

This issue was caused by this commit, which moved the module set_passwords to cloud-init stage.

In the module set_passwords, it will restart sshd if the sshd service is running. And cloud-init.service has condition "Before=sshd.service", so after moving the set_passwords to cloud-init stage, when sshd is running and start cloud-init.service directly, cloud-init will restart sshd, after sshd is down, the cloud-init will be stuck as the "Before=sshd.service".

$ cat /etc/cloud/cloud.cfg
......
cloud_init_modules:

seed_random
bootcmd
write_files
growpart
resizefs
disk_setup
mounts
set_hostname
update_hostname
update_etc_hosts
ca_certs
rsyslog
users_groups
ssh
set_passwords
......

$ cat ./usr/lib/systemd/system/cloud-init.service
[Unit]
Description=Cloud-init: Network Stage
Wants=cloud-init-local.service
Wants=sshd-keygen.service
Wants=sshd.service
After=cloud-init-local.service
After=systemd-networkd-wait-online.service

After=NetworkManager.service
After=NetworkManager-wait-online.service
Before=network-online.target
Before=sshd-keygen.service
Before=sshd.service
......

Steps to reproduce the problem

Create VM, check the sshd.service is running
$ systemctl status sshd
● sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: enabled)
Active: active (running) since Fri 2024-12-13 22:45:10 CST; 1min 25s ago
...
Install cloud-init in the VM
Start cloud-init.service directly
$ systemctl start cloud-init
After step 3, the cloud-init is stuck, the sshd is down(inactive), it is stuck at "sytemctl restart sshd" in cloud-init.log
-------------------
821 2024-12-12 02:06:46,008 - modules.py[DEBUG]: Running module set_passwords (<module 'cloudinit.config.cc_set_passwords' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_set_passwords.py'>) with f requency once-per-instance
822 2024-12-12 02:06:46,008 - handlers.py[DEBUG]: start: init-network/config-set_passwords: running config-set_passwords with frequency once-per-instance
823 2024-12-12 02:06:46,008 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords - wb: [644] 25 bytes
824 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
825 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
826 2024-12-12 02:06:46,009 - helpers.py[DEBUG]: Running config-set_passwords using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords'>)
827 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
828 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
829 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
830 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
831 2024-12-12 02:06:46,010 - ssh_util.py[DEBUG]: line 131: option PasswordAuthentication added with no
832 2024-12-12 02:06:46,010 - util.py[DEBUG]: Writing to /etc/ssh/sshd_config - wb: [600] 3680 bytes
833 2024-12-12 02:06:46,010 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
834 2024-12-12 02:06:46,011 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
835 2024-12-12 02:06:46,011 - subp.py[DEBUG]: Running command ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
836 2024-12-12 02:06:46,022 - performance.py[DEBUG]: Running ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] took 0.011 seconds

837 2024-12-12 02:06:46,023 - subp.py[DEBUG]: Running command ['systemctl', 'restart', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
Stuck here ...
-------------------

Environment details

Cloud-init version: 24.4
Operating System Distribution: RHEL
Cloud provider, platform or installer type: VMware ESXi

cloud-init logs

-------------------
821 2024-12-12 02:06:46,008 - modules.py[DEBUG]: Running module set_passwords (<module 'cloudinit.config.cc_set_passwords' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_set_passwords.py'>) with f requency once-per-instance
822 2024-12-12 02:06:46,008 - handlers.py[DEBUG]: start: init-network/config-set_passwords: running config-set_passwords with frequency once-per-instance
823 2024-12-12 02:06:46,008 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords - wb: [644] 25 bytes
824 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
825 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
826 2024-12-12 02:06:46,009 - helpers.py[DEBUG]: Running config-set_passwords using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords'>)
827 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
828 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
829 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
830 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
831 2024-12-12 02:06:46,010 - ssh_util.py[DEBUG]: line 131: option PasswordAuthentication added with no
832 2024-12-12 02:06:46,010 - util.py[DEBUG]: Writing to /etc/ssh/sshd_config - wb: [600] 3680 bytes
833 2024-12-12 02:06:46,010 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
834 2024-12-12 02:06:46,011 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
835 2024-12-12 02:06:46,011 - subp.py[DEBUG]: Running command ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
836 2024-12-12 02:06:46,022 - performance.py[DEBUG]: Running ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] took 0.011 seconds

837 2024-12-12 02:06:46,023 - subp.py[DEBUG]: Running command ['systemctl', 'restart', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
Stuck here ...
-------------------

Additional info

When revert this commit, the issue is gone.
When include the above commit, but delete the "Before=sshd.service" in cloud-init.service, the issue is gone.

ani-sinha · 2024-12-14T17:18:36Z

Yes I believe the upstream commit breaks RHEL/CentOS because of the cyclic dependency between cloud-init.service and sshd.service. One cannot restart sshd while starting cloud-init.service. It needs to be fixed in whichever way upstream finds appropriate.

This fix should also be backported to all stable upstream cloud-init versions 24.2 and above.

ani-sinha · 2024-12-14T17:33:58Z

@TheRealFalcon @holmanb Please consider this as high priority. I think this affects most/all systemd driven distros. If no proper fix can be found, I think its ok to revert the change as it will have least colateral.

holmanb · 2024-12-16T18:12:46Z

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block
indefinitely. This in turn blocks the cloud-init network stage from starting.
The restart of sshd would be needed if there is a change in sshd config -
the restart would make sure that the config is effective. cloud-init unit file
is configured to start before sshd service (there is a Before=sshd.service in
the unit file). This cyclic dependency causes a deadlock and systemd waits
indefinitely for cloud-init service to start while cloud-init waits for sshd
to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

I don't think that this is actually a high priority for a number of reasons:

If was an important feature then I would have expected a bug report sooner than this. Releases 24.2 and 24.3 have been in use on public clouds for quite a while - yet this is the first bug report. If this were truly important I would have expected a user report months ago.
There is a trivial workaround: invoking the command manually should produce the same results without the systemd deadlock.
This isn't really an expected use case of cloud-init. Interacting with cloud-init in this way is neither expected nor required.
Running this stage via systemctl start cloud-init.service will not behave the same way in the future.

At the bare minimum, e7d8328 ("perf(set_passwords): Run
module in Network stage (#5395)") needs to be reverted

No, reverting has downsides too. A code fix appears simple, I'll propose something.

all releases that have this change (24.2, 24.3 and 24.4).

Upstream cloud-init doesn't backport fixes to old upstream releases. We occasionally cherry-pick patches from main into the latest release branch. Once 24.4 is released there is no benefit for upstream to support 24.3 since any fixes for 24.3 exist in a newer release.

holmanb · 2024-12-16T20:52:20Z

@ani-sinha @zhaohuijuan Please test #5935

ani-sinha · 2024-12-17T01:17:38Z

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block

indefinitely. This in turn blocks the cloud-init network stage from starting.

The restart of sshd would be needed if there is a change in sshd config -

the restart would make sure that the config is effective. cloud-init unit file

is configured to start before sshd service (there is a Before=sshd.service in

the unit file). This cyclic dependency causes a deadlock and systemd waits

indefinitely for cloud-init service to start while cloud-init waits for sshd

to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

Why only for manual invocation ?

I don't think that this is actually a high priority for a number of reasons:

If was an important feature then I would have expected a bug report sooner than this. Releases 24.2 and 24.3 have been in use on public clouds for quite a while - yet this is the first bug report. If this were truly important I would have expected a user report months ago.

There is a trivial workaround: invoking the command manually should produce the same results without the systemd deadlock.

This isn't really an expected use case of cloud-init. Interacting with cloud-init in this way is neither expected nor required.

Running this stage via systemctl start cloud-init.service will not behave the same way in the future.

Why?

At the bare minimum, e7d8328 ("perf(set_passwords): Run

module in Network stage (#5395)") needs to be reverted

No, reverting has downsides too. A code fix appears simple, I'll propose something.

all releases that have this change (24.2, 24.3 and 24.4).

Upstream cloud-init doesn't backport fixes to old upstream releases. We occasionally cherry-pick patches from main into the latest release branch. Once 24.4 is released there is no benefit for upstream to support 24.3 since any fixes for 24.3 exist in a newer release.

So there is no concept of LTS I guess ...

ani-sinha · 2024-12-17T02:35:39Z

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block

indefinitely. This in turn blocks the cloud-init network stage from starting.

The restart of sshd would be needed if there is a change in sshd config -

the restart would make sure that the config is effective. cloud-init unit file

is configured to start before sshd service (there is a Before=sshd.service in

the unit file). This cyclic dependency causes a deadlock and systemd waits

indefinitely for cloud-init service to start while cloud-init waits for sshd

to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

Why only for manual invocation ?

hmm, OK I see your comment

+            # This module runs Before=sshd.service. What that means is that
+            # the code can only get to this point if a user manually starts the
+            # network stage. While this isn't a well-supported use-case, this
+            # does cause a deadlock if started via systemd directly:
+            # "systemctl start cloud-init.service". Prevent users from causing
+            # this deadlock by forcing systemd to ignore dependencies when
+            # restarting. Note that this deadlock is not possible in newer
+            # versions of cloud-init, since starting the second service doesn't
+            # run the second stage in 24.3+. This code therefore exists solely
+            # for backwards compatibility so that users who think that they
+            # need to manually start cloud-init (why?) with systemd (again,
+            # why?) can do so.

Yes, it can happen when sshd is already started as a part of the boot up sequence but we manually again restart cloud-init after the fact later. But it can also happen when users install cloud-init rpm and start cloud-init on a system that did not have cloud-init rpm installed (which is how I reproduced this issue). This case is I think quite common and should not be ignored.

We did see this issue in 24.4 so seems you are wrong when you say:

+            # Note that this deadlock is not possible in newer
+            # versions of cloud-init, since starting the second service doesn't
+            # run the second stage in 24.3+

zhaohuijuan · 2024-12-17T02:36:44Z

Thanks @ani-sinha and @holmanb for looking at this.

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

Actually it is easy to reproduce it manually:

Create an instance without cloud-init
Then install cloud-init and start cloud-init.service

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block
indefinitely. This in turn blocks the cloud-init network stage from starting.
The restart of sshd would be needed if there is a change in sshd config -
the restart would make sure that the config is effective. cloud-init unit file
is configured to start before sshd service (there is a Before=sshd.service in
the unit file). This cyclic dependency causes a deadlock and systemd waits
indefinitely for cloud-init service to start while cloud-init waits for sshd
to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

Although this is a corner case, but the cloud-init is stuck when hit it, the impact is bad, I think it is worthwhile to fix it.
It should not stuck anyway.

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests, and hit this issue after upgrade to 24.4, I think it is a regression and worthwhile to fix.

ani-sinha · 2024-12-17T02:44:58Z

Thanks @ani-sinha and @holmanb for looking at this.

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

Actually it is easy to reproduce it manually:

Create an instance without cloud-init

Then install cloud-init and start cloud-init.service

Yes this is the situation where the image did not come with cloud-init and the user installs the rpms and starts cloud-init later.

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block
indefinitely. This in turn blocks the cloud-init network stage from starting.
The restart of sshd would be needed if there is a change in sshd config -
the restart would make sure that the config is effective. cloud-init unit file
is configured to start before sshd service (there is a Before=sshd.service in
the unit file). This cyclic dependency causes a deadlock and systemd waits
indefinitely for cloud-init service to start while cloud-init waits for sshd
to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

Although this is a corner case,

I would not call it a corner case.

but the cloud-init is stuck when hit it, the impact is bad, I think it is worthwhile to fix it. It should not stuck anyway.

Yes, the way this breaks in a bad way and needs to be fixed.

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests, and hit this issue after upgrade to 24.4, I think it is a regression and worthwhile to fix.

ani-sinha · 2024-12-17T03:40:12Z

We did see this issue in 24.4 so seems you are wrong when you say:

Oh we did revert single process optimization patch and 0680d03304c34fc4c3081f29d99f140d507dd923 ("chore: eliminate redundant ordering dependencies (#5819)" . So maybe with those reversals, it is possible to bump into this.

ani-sinha · 2024-12-17T06:17:15Z

@ani-sinha @zhaohuijuan Please test #5935

We teated this and I reviewed your PR. The quote around ignore-dependencies is not needed and breaks your patch. I removed it and re-tested yours and the issue seems fixed.
However, this ignore-dependencies only works for those distros that uses systemd and _restart_ssh_daemon() is called for both. So you might want to wrap it within

from cloudinit.distros import uses_systemd
if uses_systemd():
...

or add an additional argument to _restart_ssh_daemon() and pass it from the code block that is called for distros that uses systemd. Something like

diff --git a/cloudinit/config/cc_set_passwords.py b/cloudinit/config/cc_set_passwords.py
index f58a1dba2..16074330f 100644
--- a/cloudinit/config/cc_set_passwords.py
+++ b/cloudinit/config/cc_set_passwords.py
@@ -45,10 +45,10 @@ def get_users_by_type(users_list: list, pw_type: str) -> list:
     )
 
 
-def _restart_ssh_daemon(distro: Distro, service: str):
+def _restart_ssh_daemon(distro: Distro, service: str, *extra_args: str):
     try:
         distro.manage_service(
-            "restart", service, "--job-mode=ignore-dependencies"
+            "restart", service, extra_args
         )
         LOG.debug("Restarted the SSH daemon.")
     except subp.ProcessExecutionError as e:
@@ -118,7 +118,7 @@ def handle_ssh_pwauth(pw_auth, distro: Distro):
             # for backwards compatibility so that users who think that they
             # need to manually start cloud-init (why?) with systemd (again,
             # why?) can do so.
-            _restart_ssh_daemon(distro, service)
+            _restart_ssh_daemon(distro, service, "--job-mode=ignore-dependencies")
     else:
         _restart_ssh_daemon(distro, service)

zhaohuijuan · 2024-12-17T06:57:44Z

@ani-sinha @zhaohuijuan Please test #5935

We tested this and I reviewed your PR. The quote around ignore-dependencies is not needed and breaks your patch. I removed it and re-tested yours and the issue seems fixed.

Yes, and regression tests are PASS on OpenStack/Azure

holmanb · 2024-12-17T17:18:37Z

So there is no concept of LTS I guess ...

Correct. Cloud-init upstream has no LTS guarantees. We communicate breaking changes in the docs to assist downstreams that want to provide long term stability.

But it can also happen when users install cloud-init rpm and start cloud-init on a system that did not have cloud-init rpm installed (which is how I reproduced this issue). This case is I think quite common and should not be ignored.

Common for who? This certainly isn't (and shouldn't be) common for users. Cloud-init is used for runtime customization. Anybody doing image build should install it into a pristine image and never start the service before first launch. Starting it manually before launching the image would leave behind artifacts and produce a dirty image.

Although this is a corner case,

I would not call it a corner case.

Again, if this were a common end user use-case then I would have expected user bug reports on Ubuntu closer to 4 months ago when 24.2 was released in Ubuntu. This has been well-exercised by users since then - and no user reports.

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests

See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior.

What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs?

I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.

We did see this issue in 24.4 so seems you are wrong when you say:

Oh we did revert single process optimization patch and 0680d03 ("chore: eliminate redundant ordering dependencies (#5819)" . So maybe with those reversals, it is possible to bump into this.

Agreed. As you noted, reverting the single process change would cause this deadlock to happen on newer versions.

holmanb · 2024-12-17T17:20:48Z

@zhaohuijuan @ani-sinha Thank you for testing. I applied the requested changes to the PR and fixed up tests accordingly.

ani-sinha · 2024-12-17T17:21:10Z

Again, if this were a common end user use-case then I would have expected user bug reports on Ubuntu closer to 4 months ago when 24.2 was released in Ubuntu. This has been well-exercised by users since then - and no user reports.

Does Ubuntu use sysyemd?

holmanb · 2024-12-17T17:21:28Z

Again, if this were a common end user use-case then I would have expected user bug reports on Ubuntu closer to 4 months ago when 24.2 was released in Ubuntu. This has been well-exercised by users since then - and no user reports.

Does Ubuntu use sysyemd?

Yes

holmanb · 2024-12-17T17:50:33Z

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests

See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior.

What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs?

I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.

A test: Install -> Start service -> Shutdown -> Reboot is sensible for a daemon service, such as sshd, gpg-agent, etc. In this case you want to be sure that the daemon works when first installed and after boot.

Cloud-init is fundamentally different. It is not a daemon that provides external services. It is a one-shot application that configures the system on startup. I don't think that this kind of test provides useful signal of a realistic user workflow.

As for the single process change: Users can still manually run cloud-init if they want to (using the CLI). A different set of commands may be required for anybody using systemctl today, but the same features are otherwise still available. No real features have been lost - users can still manually run cloud-init stages.

It seems to me like the primary reason to make this change is to satisfy a CI test that doesn't represent a real user requirement and will need to be changed in the future anyways. Please correct me if I'm wrong.

zhaohuijuan · 2024-12-18T04:41:44Z

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests

See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior.
What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs?
I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.

Agree with here, and we already updated our CI when hit this issue.

A test: Install -> Start service -> Shutdown -> Reboot is sensible for a daemon service, such as sshd, gpg-agent, etc. In this case you want to be sure that the daemon works when first installed and after boot.

Cloud-init is fundamentally different. It is not a daemon that provides external services. It is a one-shot application that configures the system on startup. I don't think that this kind of test provides useful signal of a realistic user workflow.

As for the single process change: Users can still manually run cloud-init if they want to (using the CLI). A different set of commands may be required for anybody using systemctl today, but the same features are otherwise still available. No real features have been lost - users can still manually run cloud-init stages.

It seems to me like the primary reason to make this change is to satisfy a CI test that doesn't represent a real user requirement and will need to be changed in the future anyways. Please correct me if I'm wrong.

Yes, but for the primary reason of the fix, I think it should not stuck anyway, even this is unusual case or corner case. So I think the fix is worthwhile.
Could you please help to push the patch merge? As we would like to backport this fix/patch to the RHEL rebase build ASAP to catch up with the RHEL release schedule.

Thanks @holmanb for the fix and quick response.

holmanb · 2024-12-18T05:34:45Z

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests

See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior.
What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs?
I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.

Agree with here, and we already updated our CI when hit this issue.

Great, thanks.

A test: Install -> Start service -> Shutdown -> Reboot is sensible for a daemon service, such as sshd, gpg-agent, etc. In this case you want to be sure that the daemon works when first installed and after boot.
Cloud-init is fundamentally different. It is not a daemon that provides external services. It is a one-shot application that configures the system on startup. I don't think that this kind of test provides useful signal of a realistic user workflow.
As for the single process change: Users can still manually run cloud-init if they want to (using the CLI). A different set of commands may be required for anybody using systemctl today, but the same features are otherwise still available. No real features have been lost - users can still manually run cloud-init stages.
It seems to me like the primary reason to make this change is to satisfy a CI test that doesn't represent a real user requirement and will need to be changed in the future anyways. Please correct me if I'm wrong.

Yes, but for the primary reason of the fix, I think it should not stuck anyway, even this is unusual case or corner case. So I think the fix is worthwhile. Could you please help to push the patch merge? As we would like to backport this fix/patch to the RHEL rebase build ASAP to catch up with the RHEL release schedule.

It looks like @TheRealFalcon beat me to it.

Thanks @holmanb for the fix and quick response.

Welcome!

holmanb · 2024-12-18T05:35:15Z

Fixed in #5935

sshedi · 2024-12-18T07:30:09Z

Hi @holmanb, @TheRealFalcon Will there a be a 24.4.1 release with this fix?

TheRealFalcon · 2024-12-18T20:23:03Z

@sshedi , we are not planning on releasing a 24.4.1. It doesn't make sense from an upstream perspective as this bug isn't currently possible unless you're already patching the upstream code, and it isn't a use case we can continue to support.

sshedi · 2024-12-19T03:39:43Z

Thanks @TheRealFalcon, makes sense.

zhaohuijuan added bug Something isn't working correctly new An issue that still needs triage labels Dec 14, 2024

zhaohuijuan changed the title ~~There is dead-lock between cloud-init and sshd in cloud-init-24.4~~ There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck Dec 14, 2024

holmanb mentioned this issue Dec 16, 2024

fix: don't deadlock when starting network service with systemctl #5935

Merged

2 tasks

holmanb closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck #5930

There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck #5930

zhaohuijuan commented Dec 14, 2024 •

edited

Loading

ani-sinha commented Dec 14, 2024

ani-sinha commented Dec 14, 2024 •

edited

Loading

holmanb commented Dec 16, 2024

holmanb commented Dec 16, 2024 •

edited

Loading

ani-sinha commented Dec 17, 2024

ani-sinha commented Dec 17, 2024

zhaohuijuan commented Dec 17, 2024

ani-sinha commented Dec 17, 2024

ani-sinha commented Dec 17, 2024

ani-sinha commented Dec 17, 2024 •

edited

Loading

zhaohuijuan commented Dec 17, 2024

holmanb commented Dec 17, 2024

holmanb commented Dec 17, 2024

ani-sinha commented Dec 17, 2024

holmanb commented Dec 17, 2024

holmanb commented Dec 17, 2024

zhaohuijuan commented Dec 18, 2024

holmanb commented Dec 18, 2024

holmanb commented Dec 18, 2024

sshedi commented Dec 18, 2024

TheRealFalcon commented Dec 18, 2024

sshedi commented Dec 19, 2024

There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck #5930

There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck #5930

Comments

zhaohuijuan commented Dec 14, 2024 • edited Loading

Bug report

Steps to reproduce the problem

Environment details

cloud-init logs

Additional info

ani-sinha commented Dec 14, 2024

ani-sinha commented Dec 14, 2024 • edited Loading

holmanb commented Dec 16, 2024

holmanb commented Dec 16, 2024 • edited Loading

ani-sinha commented Dec 17, 2024

ani-sinha commented Dec 17, 2024

zhaohuijuan commented Dec 17, 2024

ani-sinha commented Dec 17, 2024

ani-sinha commented Dec 17, 2024

ani-sinha commented Dec 17, 2024 • edited Loading

zhaohuijuan commented Dec 17, 2024

holmanb commented Dec 17, 2024

holmanb commented Dec 17, 2024

ani-sinha commented Dec 17, 2024

holmanb commented Dec 17, 2024

holmanb commented Dec 17, 2024

zhaohuijuan commented Dec 18, 2024

holmanb commented Dec 18, 2024

holmanb commented Dec 18, 2024

sshedi commented Dec 18, 2024

TheRealFalcon commented Dec 18, 2024

sshedi commented Dec 19, 2024

zhaohuijuan commented Dec 14, 2024 •

edited

Loading

ani-sinha commented Dec 14, 2024 •

edited

Loading

holmanb commented Dec 16, 2024 •

edited

Loading

ani-sinha commented Dec 17, 2024 •

edited

Loading