Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to start instances on new cluster members #247

Open
roosterfish opened this issue Dec 12, 2023 · 13 comments
Open

Failed to start instances on new cluster members #247

roosterfish opened this issue Dec 12, 2023 · 13 comments
Assignees
Labels
Bug Confirmed to be a bug

Comments

@roosterfish
Copy link
Contributor

roosterfish commented Dec 12, 2023

Version

Same versions of the snaps on all cluster members.

root@m1:~# snap list
Name        Version                 Rev    Tracking       Publisher   Notes
core20      20230801                2015   latest/stable  canonical✓  base
core22      20231123                1033   latest/stable  canonical✓  base
lxd         5.19-8635f82            26200  latest/stable  canonical✓  -
microceph   0+git.7b5672b           707    quincy/stable  canonical✓  -
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  -
microovn    22.03.3+snap1d18f95c73  349    22.03/stable   canonical✓  -
snapd       2.60.4                  20290  latest/stable  canonical✓  snapd

Description

After adding a new member to the MicroCloud cluster using microcloud add, existing instances can be moved to the new cluster member but fail when getting started:

root@m3:~# lxc mv v1 --target m4
root@m3:~# lxc start v1
Error: Failed pre-start check for device "eth0": Network "default" unavailable on this server
Try `lxc info --show-log v1` for more info

The networks status on the new member is also marked as Unavailable:

root@m1:~# lxc network show default --target m4
config:
  bridge.mtu: "1442"
  ipv4.address: 10.85.238.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:a345:26de:b041::1/64
  ipv6.nat: "true"
  network: UPLINK
  volatile.network.ipv4.address: 10.247.231.100
description: ""
name: default
type: ovn
used_by:
- /1.0/instances/v1
- /1.0/profiles/default
managed: true
status: Unavailable
locations:
- m4
- m1
- m2
- m3

In the logs of m4 you can see the following message every minute:

Dec 12 14:44:31 m4 lxd.daemon[5657]: time="2023-12-12T14:44:31Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"default\" setup: Failed to run: ovn-nbctl --timeout=10 --db unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock --wait=sb --format=csv --no-headings --data=bare --colum=_uuid,name,acl find port_group name=lxd_net2: exit status 1 (ovn-nbctl: unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory))" network=default project=default
@roosterfish roosterfish added the Bug Confirmed to be a bug label Dec 12, 2023
@tomponline
Copy link
Member

@roosterfish whenever you're reporting a (potential) cross-snap issue (or really anytime you're reporting a microcloud issue) it would be useful to see the output of snap list on each server so we can get a view of precisely which snap revisions of microcloud, lxd, microceph and microovn are installed. Thanks

@roosterfish
Copy link
Contributor Author

For now a workaround is to reload the LXD daemon on the affected cluster member using systemctl snap.lxd.daemon reload. Afterwards the network reports the status Created and can be used accordingly.

@tomponline
Copy link
Member

@roosterfish what do the LXD logs show for the error/reason for the network not being starable?

@roosterfish
Copy link
Contributor Author

@roosterfish what do the LXD logs show for the error/reason for the network not being starable?

I have updated the description.

@tomponline
Copy link
Member

@roosterfish @masnax the unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock reference suggests LXD was started before microovn was installed? As it doesn't seem to be using the microovn location. Is that right @masnax?

@gabrielmougard gabrielmougard self-assigned this Jan 3, 2024
@gabrielmougard
Copy link
Contributor

gabrielmougard commented Jan 3, 2024

I managed to reproduce this too. I'm having a look at it. I have one question though: for a 4 nodes configuration, we have 3 ovn-central services anyway to guarantee OVN HA (on m1, m2, m3, each with a /var/snap/microovn/common/run/ovn/ovnnb_db.sock file) so the fourth node is not supposed to have a ovnnb_db.sock anyway right (m4 only runs a ovn-chassis and a ovn-switch)? Is that right @tomponline ?

@masnax
Copy link
Contributor

masnax commented Jan 11, 2024

Looks like this is an issue with LXD cluster joins. It seems joining a cluster after the fact by using MemberConfig sets up ovn differently than the initial creation of the cluster does.

I'm able to reproduce this only when adding nodes to an existing cluster, whereas using the same nodes and initializing the whole cluster at that size results in the network working fine.

I'm still trying to figure out what LXD's doing exactly, but what I've gathered from the request payloads so far is that when creating the network on init, the payloads look like

{NetworkPut:{Config:map[parent:enp6s0] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[] Description:Uplink for OVN networks} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[network:UPLINK] Description:Default OVN network} Name:default Type:ovn}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK parent:enp6s0] Description:} Name:default Type:ovn}"

and when adding a node, it's

{NetworkPut:{Config:map[parent:enp6s0 volatile.last_state.created:false] Description:} Name:UPLINK Type:physical}"
{NetworkPut:{Config:map[bridge.mtu:1442 ipv4.address:10.18.8.1/24 ipv4.nat:true ipv6.address:fd42:cbc4:cc49:8d30::1/64 ipv6.nat:true network:UPLINK] Description:} Name:default Type:ovn}"

Main difference is that the parent config field is set in for the default network when initializing the cluster, but that's not a valid key for the ovn network anyway.

@masnax
Copy link
Contributor

masnax commented Jan 11, 2024

Hm, indeed that actually was it. If that final payload has parent=enp6s0 set, then the network forms properly.

@masnax
Copy link
Contributor

masnax commented Sep 3, 2024

From my initial testing, this appears to be fixed in LXD now. The default network is still created on the new cluster members without parent set (which is valid, since parent is a member-specific config and an ovn type network has no member-specific configuration), but this no longer seems to affect the functionality of the network on that cluster member.

@roosterfish If you remember the initial setup you used to replicate this, could you please give it a shot to ensure I'm not missing an edge case?

@roosterfish
Copy link
Contributor Author

It looks 5.21/stable is still affected by this as I see the same error when starting the instance on the new member.

I suspect LXD 5.21/stable is the one we will recommend installing when we release the MicroCloud LTS?

I have deployed the following set of snaps:

lxd         5.21.2-2f4ba6b          30131  5.21/stable    canonical✓  in-cohort
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort

The same error occurs with MicroCloud latest/edge (I have used the latest Ceph in order to not get any errors with edge MicroCloud):

lxd         5.21.2-2f4ba6b             30131  5.21/stable    canonical✓  in-cohort
microceph   19.2.0~git+snap36f71d7700  1148   latest/edge    canonical✓  in-cohort
microcloud  git-ebaa9ba                955    latest/edge    canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5     395    22.03/stable   canonical✓  in-cohort

And when using LXD latest/stable the same error still appears:

lxd         6.1-78a3d8f                30130  latest/stable  canonical✓  in-cohort
microceph   19.2.0~git+snap36f71d7700  1148   latest/edge    canonical✓  in-cohort
microcloud  git-ebaa9ba                955    latest/edge    canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5     395    22.03/stable   canonical✓  in-cohort

The reproducer steps:

  1. Bootstrap cluster with members m1, m2, m3
  2. Start a new instance v1
  3. Stop the instance
  4. Add member m4 to the cluster
  5. Move the stopped instance v1 to m4
  6. Start the instance v1
  7. Error on m4 (snap logs lxd): 2024-09-04T08:39:53Z lxd.daemon[2881]: time="2024-09-04T08:39:53Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"default\" setup: Failed to run: ovn-nbctl --timeout=10 --db unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock --wait=sb --format=csv --no-headings --data=bare --colum=_uuid,name,acl find port_group name=lxd_net2: exit status 1 (ovn-nbctl: unix:/var/lib/snapd/hostfs/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory))" network=default project=default

@tomponline
Copy link
Member

@roosterfish is 5.21/edge affected?

@roosterfish
Copy link
Contributor Author

roosterfish commented Sep 4, 2024

@roosterfish is 5.21/edge affected?

Mh, 5.21/edge seems to be not affected. No error when starting the instance on m4.

Using all the stable snaps:

lxd         git-75a87af             30149  5.21/edge      canonical✓  in-cohort
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort

@tomponline
Copy link
Member

tomponline commented Sep 4, 2024

@roosterfish is 5.21/edge affected?

Mh, 5.21/edge seems to be not affected. No error when starting the instance on m4.

Using all the stable snaps:

lxd         git-75a87af             30149  5.21/edge      canonical✓  in-cohort
microceph   0+git.4a608fc           793    quincy/stable  canonical✓  in-cohort
microcloud  1.1-04a1c49             734    latest/stable  canonical✓  in-cohort
microovn    22.03.3+snap0e23a0e4f5  395    22.03/stable   canonical✓  in-cohort

Great, so its been fixed in a backport. And will be in 5.21.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

No branches or pull requests

4 participants