Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick recovery after cluster member failure, how? #272

Open
webdock-io opened this issue Mar 23, 2024 · 1 comment
Open

Quick recovery after cluster member failure, how? #272

webdock-io opened this issue Mar 23, 2024 · 1 comment
Labels
Incomplete Waiting on more information from reporter

Comments

@webdock-io
Copy link

Testing Microcloud further here and I wanted to simulate a cluster member catastrophic failure. Easy enough to do in my sandbox setup where I have microcloud set up on 3vms on the same physical host.

I simply ran "lxc stop --force" on a vm which had a running lxd container on it, in order to simulate a crash.

I then - maybe naively - assumed that the amazing thing with clustered lxd and ceph would be that I should be able to just spin up the container immediately on another cluster member. Right? However, I had a hard time finding information regarding recommended steps online, so I just tried the following:

# lxc exec lxdvm1 bash
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
|   NAME   | STATE | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR |      |      | CONTAINER | 0         | lxdvm2   |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1
Error: Get "https://10.1.255.88:8443/1.0/instances/lxdtest1": Unable to connect to: 10.1.255.88:8443 ([dial tcp 10.1.255.88:8443: connect: no route to host])
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
|   NAME   | STATE | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR |      |      | CONTAINER | 0         | lxdvm2   |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc move lxdtest1 --target lxdvm1
root@lxdvm1:~# lxc ls
+----------+---------+------+------+-----------+-----------+----------+
|   NAME   |  STATE  | IPV4 | IPV6 |   TYPE    | SNAPSHOTS | LOCATION |
+----------+---------+------+------+-----------+-----------+----------+
| lxdtest1 | STOPPED |      |      | CONTAINER | 1         | lxdvm1   |
+----------+---------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1


Error: User signaled us three times, exiting. The remote operation will keep running
Try `lxc info --show-log lxdtest1` for more info

As you can see the lxc start just hangs. I tried a few times but it just sits there. lxc info --show-log reveals nothing useful.

Is this not how it's supposed to work? Surely quickly being able to recover from a node going down is one of the core points of all this clustering/ceph goodness, or am I just thinking about this wrong? :)

Thank you for any insights you can provide here

@webdock-io
Copy link
Author

After some more testing today, on a second test run this just worked. I found an old forum post from Stephane Graber that this is indeed the way to do it. I am happy this works, but I am at a loss why it didn't work the first time.

Yesterday I tried running lxc monitor to see what was happening, and the start operation was processed but left in a "pending" state it seemed. What lxd was waiting for and why, I can't tell. I suspect it was my environment and related to networking as I've been having some problems with that on the lxdvm1 instance - so this is probably just my messy test environment to blame here.

Anyway, this works - for now - I'll update here if I hit this particular issue again as I'll be doing a lot of testing in the coming week or so for various scenarios. If I see noting further I'll make sure to close this.

@roosterfish roosterfish added the Incomplete Waiting on more information from reporter label Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Incomplete Waiting on more information from reporter
Projects
None yet
Development

No branches or pull requests

2 participants