You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Testing Microcloud further here and I wanted to simulate a cluster member catastrophic failure. Easy enough to do in my sandbox setup where I have microcloud set up on 3vms on the same physical host.
I simply ran "lxc stop --force" on a vm which had a running lxd container on it, in order to simulate a crash.
I then - maybe naively - assumed that the amazing thing with clustered lxd and ceph would be that I should be able to just spin up the container immediately on another cluster member. Right? However, I had a hard time finding information regarding recommended steps online, so I just tried the following:
# lxc exec lxdvm1 bash
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR | | | CONTAINER | 0 | lxdvm2 |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1
Error: Get "https://10.1.255.88:8443/1.0/instances/lxdtest1": Unable to connect to: 10.1.255.88:8443 ([dial tcp 10.1.255.88:8443: connect: no route to host])
root@lxdvm1:~# lxc ls
+----------+-------+------+------+-----------+-----------+----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+----------+-------+------+------+-----------+-----------+----------+
| lxdtest1 | ERROR | | | CONTAINER | 0 | lxdvm2 |
+----------+-------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc move lxdtest1 --target lxdvm1
root@lxdvm1:~# lxc ls
+----------+---------+------+------+-----------+-----------+----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS | LOCATION |
+----------+---------+------+------+-----------+-----------+----------+
| lxdtest1 | STOPPED | | | CONTAINER | 1 | lxdvm1 |
+----------+---------+------+------+-----------+-----------+----------+
root@lxdvm1:~# lxc start lxdtest1
Error: User signaled us three times, exiting. The remote operation will keep running
Try `lxc info --show-log lxdtest1` for more info
As you can see the lxc start just hangs. I tried a few times but it just sits there. lxc info --show-log reveals nothing useful.
Is this not how it's supposed to work? Surely quickly being able to recover from a node going down is one of the core points of all this clustering/ceph goodness, or am I just thinking about this wrong? :)
Thank you for any insights you can provide here
The text was updated successfully, but these errors were encountered:
After some more testing today, on a second test run this just worked. I found an old forum post from Stephane Graber that this is indeed the way to do it. I am happy this works, but I am at a loss why it didn't work the first time.
Yesterday I tried running lxc monitor to see what was happening, and the start operation was processed but left in a "pending" state it seemed. What lxd was waiting for and why, I can't tell. I suspect it was my environment and related to networking as I've been having some problems with that on the lxdvm1 instance - so this is probably just my messy test environment to blame here.
Anyway, this works - for now - I'll update here if I hit this particular issue again as I'll be doing a lot of testing in the coming week or so for various scenarios. If I see noting further I'll make sure to close this.
Testing Microcloud further here and I wanted to simulate a cluster member catastrophic failure. Easy enough to do in my sandbox setup where I have microcloud set up on 3vms on the same physical host.
I simply ran "lxc stop --force" on a vm which had a running lxd container on it, in order to simulate a crash.
I then - maybe naively - assumed that the amazing thing with clustered lxd and ceph would be that I should be able to just spin up the container immediately on another cluster member. Right? However, I had a hard time finding information regarding recommended steps online, so I just tried the following:
As you can see the lxc start just hangs. I tried a few times but it just sits there. lxc info --show-log reveals nothing useful.
Is this not how it's supposed to work? Surely quickly being able to recover from a node going down is one of the core points of all this clustering/ceph goodness, or am I just thinking about this wrong? :)
Thank you for any insights you can provide here
The text was updated successfully, but these errors were encountered: