Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck #5930

Closed
zhaohuijuan opened this issue Dec 14, 2024 · 22 comments
Labels
bug Something isn't working correctly new An issue that still needs triage

Comments

@zhaohuijuan
Copy link

zhaohuijuan commented Dec 14, 2024

Bug report

There is dead-lock between cloud-init.service and sshd.service in cloud-init-24.4, which causes the cloud-init stuck when start the cloud-init.service directly.
No such issue in cloud-init-24.1.4

This issue was caused by this commit, which moved the module set_passwords to cloud-init stage.

In the module set_passwords, it will restart sshd if the sshd service is running. And cloud-init.service has condition "Before=sshd.service", so after moving the set_passwords to cloud-init stage, when sshd is running and start cloud-init.service directly, cloud-init will restart sshd, after sshd is down, the cloud-init will be stuck as the "Before=sshd.service".

$ cat /etc/cloud/cloud.cfg
......
cloud_init_modules:

  • seed_random
  • bootcmd
  • write_files
  • growpart
  • resizefs
  • disk_setup
  • mounts
  • set_hostname
  • update_hostname
  • update_etc_hosts
  • ca_certs
  • rsyslog
  • users_groups
  • ssh
  • set_passwords
    ......

$ cat ./usr/lib/systemd/system/cloud-init.service
[Unit]
Description=Cloud-init: Network Stage
Wants=cloud-init-local.service
Wants=sshd-keygen.service
Wants=sshd.service
After=cloud-init-local.service
After=systemd-networkd-wait-online.service

After=NetworkManager.service
After=NetworkManager-wait-online.service
Before=network-online.target
Before=sshd-keygen.service
Before=sshd.service
......

Steps to reproduce the problem

  1. Create VM, check the sshd.service is running
    $ systemctl status sshd
    ● sshd.service - OpenSSH server daemon
    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: enabled)
    Active: active (running) since Fri 2024-12-13 22:45:10 CST; 1min 25s ago
    ...
  2. Install cloud-init in the VM
  3. Start cloud-init.service directly
    $ systemctl start cloud-init
  4. After step 3, the cloud-init is stuck, the sshd is down(inactive), it is stuck at "sytemctl restart sshd" in cloud-init.log
    -------------------
    821 2024-12-12 02:06:46,008 - modules.py[DEBUG]: Running module set_passwords (<module 'cloudinit.config.cc_set_passwords' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_set_passwords.py'>) with f requency once-per-instance
    822 2024-12-12 02:06:46,008 - handlers.py[DEBUG]: start: init-network/config-set_passwords: running config-set_passwords with frequency once-per-instance
    823 2024-12-12 02:06:46,008 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords - wb: [644] 25 bytes
    824 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
    825 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
    826 2024-12-12 02:06:46,009 - helpers.py[DEBUG]: Running config-set_passwords using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords'>)
    827 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
    828 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
    829 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
    830 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
    831 2024-12-12 02:06:46,010 - ssh_util.py[DEBUG]: line 131: option PasswordAuthentication added with no
    832 2024-12-12 02:06:46,010 - util.py[DEBUG]: Writing to /etc/ssh/sshd_config - wb: [600] 3680 bytes
    833 2024-12-12 02:06:46,010 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
    834 2024-12-12 02:06:46,011 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
    835 2024-12-12 02:06:46,011 - subp.py[DEBUG]: Running command ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
    836 2024-12-12 02:06:46,022 - performance.py[DEBUG]: Running ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] took 0.011 seconds
  • 837 2024-12-12 02:06:46,023 - subp.py[DEBUG]: Running command ['systemctl', 'restart', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
  • Stuck here ...
    -------------------

Environment details

  • Cloud-init version: 24.4
  • Operating System Distribution: RHEL
  • Cloud provider, platform or installer type: VMware ESXi

cloud-init logs

-------------------
821 2024-12-12 02:06:46,008 - modules.py[DEBUG]: Running module set_passwords (<module 'cloudinit.config.cc_set_passwords' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_set_passwords.py'>) with f requency once-per-instance
822 2024-12-12 02:06:46,008 - handlers.py[DEBUG]: start: init-network/config-set_passwords: running config-set_passwords with frequency once-per-instance
823 2024-12-12 02:06:46,008 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords - wb: [644] 25 bytes
824 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
825 2024-12-12 02:06:46,009 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords (recursive=False)
826 2024-12-12 02:06:46,009 - helpers.py[DEBUG]: Running config-set_passwords using lock (<FileLock using file '/var/lib/cloud/instances/iid-datasource-none/sem/config_set_passwords'>)
827 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
828 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
829 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
830 2024-12-12 02:06:46,010 - util.py[DEBUG]: Reading 3654 bytes from /etc/ssh/sshd_config
831 2024-12-12 02:06:46,010 - ssh_util.py[DEBUG]: line 131: option PasswordAuthentication added with no
832 2024-12-12 02:06:46,010 - util.py[DEBUG]: Writing to /etc/ssh/sshd_config - wb: [600] 3680 bytes
833 2024-12-12 02:06:46,010 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
834 2024-12-12 02:06:46,011 - util.py[DEBUG]: Restoring selinux mode for /etc/ssh/sshd_config (recursive=False)
835 2024-12-12 02:06:46,011 - subp.py[DEBUG]: Running command ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
836 2024-12-12 02:06:46,022 - performance.py[DEBUG]: Running ['systemctl', 'show', '--property', 'ActiveState', '--value', 'sshd'] took 0.011 seconds

  • 837 2024-12-12 02:06:46,023 - subp.py[DEBUG]: Running command ['systemctl', 'restart', 'sshd'] with allowed return codes [0] (shell=False, capture=True)
  • Stuck here ...
    -------------------

Additional info

  1. When revert this commit, the issue is gone.
  2. When include the above commit, but delete the "Before=sshd.service" in cloud-init.service, the issue is gone.
@zhaohuijuan zhaohuijuan added bug Something isn't working correctly new An issue that still needs triage labels Dec 14, 2024
@zhaohuijuan zhaohuijuan changed the title There is dead-lock between cloud-init and sshd in cloud-init-24.4 There is dead-lock between cloud-init and sshd in cloud-init-24.4 which causes the cloud-init stuck Dec 14, 2024
@ani-sinha
Copy link
Contributor

Yes I believe the upstream commit breaks RHEL/CentOS because of the cyclic dependency between cloud-init.service and sshd.service. One cannot restart sshd while starting cloud-init.service. It needs to be fixed in whichever way upstream finds appropriate.

This fix should also be backported to all stable upstream cloud-init versions 24.2 and above.

@ani-sinha
Copy link
Contributor

ani-sinha commented Dec 14, 2024

@TheRealFalcon @holmanb Please consider this as high priority. I think this affects most/all systemd driven distros. If no proper fix can be found, I think its ok to revert the change as it will have least colateral.

@holmanb
Copy link
Member

holmanb commented Dec 16, 2024

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block
indefinitely. This in turn blocks the cloud-init network stage from starting.
The restart of sshd would be needed if there is a change in sshd config -
the restart would make sure that the config is effective. cloud-init unit file
is configured to start before sshd service (there is a Before=sshd.service in
the unit file). This cyclic dependency causes a deadlock and systemd waits
indefinitely for cloud-init service to start while cloud-init waits for sshd
to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

I don't think that this is actually a high priority for a number of reasons:

  1. If was an important feature then I would have expected a bug report sooner than this. Releases 24.2 and 24.3 have been in use on public clouds for quite a while - yet this is the first bug report. If this were truly important I would have expected a user report months ago.
  2. There is a trivial workaround: invoking the command manually should produce the same results without the systemd deadlock.
  3. This isn't really an expected use case of cloud-init. Interacting with cloud-init in this way is neither expected nor required.
  4. Running this stage via systemctl start cloud-init.service will not behave the same way in the future.

At the bare minimum, e7d8328 ("perf(set_passwords): Run
module in Network stage (#5395)") needs to be reverted

No, reverting has downsides too. A code fix appears simple, I'll propose something.

all releases that have this change (24.2, 24.3 and 24.4).

Upstream cloud-init doesn't backport fixes to old upstream releases. We occasionally cherry-pick patches from main into the latest release branch. Once 24.4 is released there is no benefit for upstream to support 24.3 since any fixes for 24.3 exist in a newer release.

@holmanb
Copy link
Member

holmanb commented Dec 16, 2024

@ani-sinha @zhaohuijuan Please test #5935

@ani-sinha
Copy link
Contributor

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block

indefinitely. This in turn blocks the cloud-init network stage from starting.

The restart of sshd would be needed if there is a change in sshd config -

the restart would make sure that the config is effective. cloud-init unit file

is configured to start before sshd service (there is a Before=sshd.service in

the unit file). This cyclic dependency causes a deadlock and systemd waits

indefinitely for cloud-init service to start while cloud-init waits for sshd

to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

Why only for manual invocation ?

I don't think that this is actually a high priority for a number of reasons:

  1. If was an important feature then I would have expected a bug report sooner than this. Releases 24.2 and 24.3 have been in use on public clouds for quite a while - yet this is the first bug report. If this were truly important I would have expected a user report months ago.

  2. There is a trivial workaround: invoking the command manually should produce the same results without the systemd deadlock.

  3. This isn't really an expected use case of cloud-init. Interacting with cloud-init in this way is neither expected nor required.

  4. Running this stage via systemctl start cloud-init.service will not behave the same way in the future.

Why?

At the bare minimum, e7d8328 ("perf(set_passwords): Run

module in Network stage (#5395)") needs to be reverted

No, reverting has downsides too. A code fix appears simple, I'll propose something.

all releases that have this change (24.2, 24.3 and 24.4).

Upstream cloud-init doesn't backport fixes to old upstream releases. We occasionally cherry-pick patches from main into the latest release branch. Once 24.4 is released there is no benefit for upstream to support 24.3 since any fixes for 24.3 exist in a newer release.

So there is no concept of LTS I guess ...

@ani-sinha
Copy link
Contributor

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block

indefinitely. This in turn blocks the cloud-init network stage from starting.

The restart of sshd would be needed if there is a change in sshd config -

the restart would make sure that the config is effective. cloud-init unit file

is configured to start before sshd service (there is a Before=sshd.service in

the unit file). This cyclic dependency causes a deadlock and systemd waits

indefinitely for cloud-init service to start while cloud-init waits for sshd

to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

Why only for manual invocation ?

hmm, OK I see your comment

+            # This module runs Before=sshd.service. What that means is that
+            # the code can only get to this point if a user manually starts the
+            # network stage. While this isn't a well-supported use-case, this
+            # does cause a deadlock if started via systemd directly:
+            # "systemctl start cloud-init.service". Prevent users from causing
+            # this deadlock by forcing systemd to ignore dependencies when
+            # restarting. Note that this deadlock is not possible in newer
+            # versions of cloud-init, since starting the second service doesn't
+            # run the second stage in 24.3+. This code therefore exists solely
+            # for backwards compatibility so that users who think that they
+            # need to manually start cloud-init (why?) with systemd (again,
+            # why?) can do so.

Yes, it can happen when sshd is already started as a part of the boot up sequence but we manually again restart cloud-init after the fact later. But it can also happen when users install cloud-init rpm and start cloud-init on a system that did not have cloud-init rpm installed (which is how I reproduced this issue). This case is I think quite common and should not be ignored.

We did see this issue in 24.4 so seems you are wrong when you say:

+            # Note that this deadlock is not possible in newer
+            # versions of cloud-init, since starting the second service doesn't
+            # run the second stage in 24.3+

@zhaohuijuan
Copy link
Author

Thanks @ani-sinha and @holmanb for looking at this.

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

Actually it is easy to reproduce it manually:

  1. Create an instance without cloud-init
  2. Then install cloud-init and start cloud-init.service

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block
indefinitely. This in turn blocks the cloud-init network stage from starting.
The restart of sshd would be needed if there is a change in sshd config -
the restart would make sure that the config is effective. cloud-init unit file
is configured to start before sshd service (there is a Before=sshd.service in
the unit file). This cyclic dependency causes a deadlock and systemd waits
indefinitely for cloud-init service to start while cloud-init waits for sshd
to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

Although this is a corner case, but the cloud-init is stuck when hit it, the impact is bad, I think it is worthwhile to fix it.
It should not stuck anyway.

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests, and hit this issue after upgrade to 24.4, I think it is a regression and worthwhile to fix.

@ani-sinha
Copy link
Contributor

Thanks @ani-sinha and @holmanb for looking at this.

I could not reproduce this issue using the steps above. It seems that a configuration that actually requires a restart (a cloud-config that modifies the sshd config) is also required to reproduce this issue.

Actually it is easy to reproduce it manually:

  1. Create an instance without cloud-init
  2. Then install cloud-init and start cloud-init.service

Yes this is the situation where the image did not come with cloud-init and the user installs the rpms and starts cloud-init later.

set_passwords module, when run earlier from cloud-init network stage> (started by cloud-init.service) would make sshd service restart block
indefinitely. This in turn blocks the cloud-init network stage from starting.
The restart of sshd would be needed if there is a change in sshd config -
the restart would make sure that the config is effective. cloud-init unit file
is configured to start before sshd service (there is a Before=sshd.service in
the unit file). This cyclic dependency causes a deadlock and systemd waits
indefinitely for cloud-init service to start while cloud-init waits for sshd
to start.

Such a deadlock is possible after that change, yes. However this will only happen if cloud-init is invoked manually (systemctl start cloud-init.service).

Although this is a corner case,

I would not call it a corner case.

but the cloud-init is stuck when hit it, the impact is bad, I think it is worthwhile to fix it. It should not stuck anyway.

Yes, the way this breaks in a bad way and needs to be fixed.

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests, and hit this issue after upgrade to 24.4, I think it is a regression and worthwhile to fix.

@ani-sinha
Copy link
Contributor

We did see this issue in 24.4 so seems you are wrong when you say:

Oh we did revert single process optimization patch and 0680d03304c34fc4c3081f29d99f140d507dd923 ("chore: eliminate redundant ordering dependencies (#5819)" . So maybe with those reversals, it is possible to bump into this.

@ani-sinha
Copy link
Contributor

ani-sinha commented Dec 17, 2024

@ani-sinha @zhaohuijuan Please test #5935

We teated this and I reviewed your PR. The quote around ignore-dependencies is not needed and breaks your patch. I removed it and re-tested yours and the issue seems fixed.
However, this ignore-dependencies only works for those distros that uses systemd and _restart_ssh_daemon() is called for both. So you might want to wrap it within

from cloudinit.distros import uses_systemd
if uses_systemd():
...

or add an additional argument to _restart_ssh_daemon() and pass it from the code block that is called for distros that uses systemd. Something like

diff --git a/cloudinit/config/cc_set_passwords.py b/cloudinit/config/cc_set_passwords.py
index f58a1dba2..16074330f 100644
--- a/cloudinit/config/cc_set_passwords.py
+++ b/cloudinit/config/cc_set_passwords.py
@@ -45,10 +45,10 @@ def get_users_by_type(users_list: list, pw_type: str) -> list:
     )
 
 
-def _restart_ssh_daemon(distro: Distro, service: str):
+def _restart_ssh_daemon(distro: Distro, service: str, *extra_args: str):
     try:
         distro.manage_service(
-            "restart", service, "--job-mode=ignore-dependencies"
+            "restart", service, extra_args
         )
         LOG.debug("Restarted the SSH daemon.")
     except subp.ProcessExecutionError as e:
@@ -118,7 +118,7 @@ def handle_ssh_pwauth(pw_auth, distro: Distro):
             # for backwards compatibility so that users who think that they
             # need to manually start cloud-init (why?) with systemd (again,
             # why?) can do so.
-            _restart_ssh_daemon(distro, service)
+            _restart_ssh_daemon(distro, service, "--job-mode=ignore-dependencies")
     else:
         _restart_ssh_daemon(distro, service)

@zhaohuijuan
Copy link
Author

@ani-sinha @zhaohuijuan Please test #5935

We tested this and I reviewed your PR. The quote around ignore-dependencies is not needed and breaks your patch. I removed it and re-tested yours and the issue seems fixed.

Yes, and regression tests are PASS on OpenStack/Azure

@holmanb
Copy link
Member

holmanb commented Dec 17, 2024

So there is no concept of LTS I guess ...

Correct. Cloud-init upstream has no LTS guarantees. We communicate breaking changes in the docs to assist downstreams that want to provide long term stability.

But it can also happen when users install cloud-init rpm and start cloud-init on a system that did not have cloud-init rpm installed (which is how I reproduced this issue). This case is I think quite common and should not be ignored.

Common for who? This certainly isn't (and shouldn't be) common for users. Cloud-init is used for runtime customization. Anybody doing image build should install it into a pristine image and never start the service before first launch. Starting it manually before launching the image would leave behind artifacts and produce a dirty image.

Although this is a corner case,

I would not call it a corner case.

Again, if this were a common end user use-case then I would have expected user bug reports on Ubuntu closer to 4 months ago when 24.2 was released in Ubuntu. This has been well-exercised by users since then - and no user reports.

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests

See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior.

What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs?

I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.

We did see this issue in 24.4 so seems you are wrong when you say:

Oh we did revert single process optimization patch and 0680d03 ("chore: eliminate redundant ordering dependencies (#5819)" . So maybe with those reversals, it is possible to bump into this.

Agreed. As you noted, reverting the single process change would cause this deadlock to happen on newer versions.

@holmanb
Copy link
Member

holmanb commented Dec 17, 2024

@zhaohuijuan @ani-sinha Thank you for testing. I applied the requested changes to the PR and fixed up tests accordingly.

@ani-sinha
Copy link
Contributor

Again, if this were a common end user use-case then I would have expected user bug reports on Ubuntu closer to 4 months ago when 24.2 was released in Ubuntu. This has been well-exercised by users since then - and no user reports.

Does Ubuntu use sysyemd?

@holmanb
Copy link
Member

holmanb commented Dec 17, 2024

Again, if this were a common end user use-case then I would have expected user bug reports on Ubuntu closer to 4 months ago when 24.2 was released in Ubuntu. This has been well-exercised by users since then - and no user reports.

Does Ubuntu use sysyemd?

Yes

@holmanb
Copy link
Member

holmanb commented Dec 17, 2024

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests

See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior.

What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs?

I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.

A test: Install -> Start service -> Shutdown -> Reboot is sensible for a daemon service, such as sshd, gpg-agent, etc. In this case you want to be sure that the daemon works when first installed and after boot.

Cloud-init is fundamentally different. It is not a daemon that provides external services. It is a one-shot application that configures the system on startup. I don't think that this kind of test provides useful signal of a realistic user workflow.

As for the single process change: Users can still manually run cloud-init if they want to (using the CLI). A different set of commands may be required for anybody using systemctl today, but the same features are otherwise still available. No real features have been lost - users can still manually run cloud-init stages.

It seems to me like the primary reason to make this change is to satisfy a CI test that doesn't represent a real user requirement and will need to be changed in the future anyways. Please correct me if I'm wrong.

@zhaohuijuan
Copy link
Author

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests

See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior.
What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs?
I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.

Agree with here, and we already updated our CI when hit this issue.

A test: Install -> Start service -> Shutdown -> Reboot is sensible for a daemon service, such as sshd, gpg-agent, etc. In this case you want to be sure that the daemon works when first installed and after boot.

Cloud-init is fundamentally different. It is not a daemon that provides external services. It is a one-shot application that configures the system on startup. I don't think that this kind of test provides useful signal of a realistic user workflow.

As for the single process change: Users can still manually run cloud-init if they want to (using the CLI). A different set of commands may be required for anybody using systemctl today, but the same features are otherwise still available. No real features have been lost - users can still manually run cloud-init stages.

It seems to me like the primary reason to make this change is to satisfy a CI test that doesn't represent a real user requirement and will need to be changed in the future anyways. Please correct me if I'm wrong.

Yes, but for the primary reason of the fix, I think it should not stuck anyway, even this is unusual case or corner case. So I think the fix is worthwhile.
Could you please help to push the patch merge? As we would like to backport this fix/patch to the RHEL rebase build ASAP to catch up with the RHEL release schedule.

Thanks @holmanb for the fix and quick response.

@holmanb
Copy link
Member

holmanb commented Dec 18, 2024

And actually we have such a case in our VMware CI job, we would like to ensure some services could start successfully before running tests

See my comment about dirty images above. Testing an image that previously started cloud-init services risks leaving behind artifacts which changes cloud-init's first boot behavior.
What signal do you expect to receive from manually running cloud-init services before launch? Wouldn't you get the same signal by launching the image without the manual service runs?
I would recommend removing this from your CI going forward. Manually starting cloud-init services via systemd doesn't add significant value over manually running the commands directly and it also isn't supported after the single process change.

Agree with here, and we already updated our CI when hit this issue.

Great, thanks.

A test: Install -> Start service -> Shutdown -> Reboot is sensible for a daemon service, such as sshd, gpg-agent, etc. In this case you want to be sure that the daemon works when first installed and after boot.
Cloud-init is fundamentally different. It is not a daemon that provides external services. It is a one-shot application that configures the system on startup. I don't think that this kind of test provides useful signal of a realistic user workflow.
As for the single process change: Users can still manually run cloud-init if they want to (using the CLI). A different set of commands may be required for anybody using systemctl today, but the same features are otherwise still available. No real features have been lost - users can still manually run cloud-init stages.
It seems to me like the primary reason to make this change is to satisfy a CI test that doesn't represent a real user requirement and will need to be changed in the future anyways. Please correct me if I'm wrong.

Yes, but for the primary reason of the fix, I think it should not stuck anyway, even this is unusual case or corner case. So I think the fix is worthwhile. Could you please help to push the patch merge? As we would like to backport this fix/patch to the RHEL rebase build ASAP to catch up with the RHEL release schedule.

It looks like @TheRealFalcon beat me to it.

Thanks @holmanb for the fix and quick response.

Welcome!

@holmanb
Copy link
Member

holmanb commented Dec 18, 2024

Fixed in #5935

@holmanb holmanb closed this as completed Dec 18, 2024
@sshedi
Copy link
Contributor

sshedi commented Dec 18, 2024

Hi @holmanb, @TheRealFalcon Will there a be a 24.4.1 release with this fix?

@TheRealFalcon
Copy link
Member

@sshedi , we are not planning on releasing a 24.4.1. It doesn't make sense from an upstream perspective as this bug isn't currently possible unless you're already patching the upstream code, and it isn't a use case we can continue to support.

@sshedi
Copy link
Contributor

sshedi commented Dec 19, 2024

Thanks @TheRealFalcon, makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly new An issue that still needs triage
Projects
None yet
Development

No branches or pull requests

5 participants