-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(azure): Avoid non-primary nics from having routes to DNS CPC-4224 #5180
Conversation
d1e179d
to
57ce7ba
Compare
57ce7ba
to
b81625e
Compare
b81625e
to
e41c2bc
Compare
Hey, @cjp256, could you also take a look at this one? Thanks. |
cloudinit/sources/DataSourceAzure.py
Outdated
"match": {"driver": "hv_netvsc", "name": "!eth0"}, | ||
} | ||
|
||
netconfig["ethernets"]["hotpluggedeth0"] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about naming the ethernets in the configuration primary
and secondary
instead of hotpluggedeth0
and ephemeral
? I believe that the former better expresses the relationship between the two. That said, I am open to other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Agreed.
"""Integration test for https://warthogs.atlassian.net/browse/CPC-3999 | ||
|
||
Azure should have the primary NIC only route to DNS. | ||
Ensure other NICs do not have route to DNS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe this test exercises the behaviour of concern. As described in https://warthogs.atlassian.net/browse/CPC-3999 , there is no route to the DHCP/DNS server through the secondary NIC on Mantic hosts (per ip route
), but the DNS resolver has still associated the DNS server with the secondary NIC and thus transmits DNS queries through that NIC. As such, I believe that this test should instead verify that there is no longer a DNS server associated with the secondary NIC in the output of resolvectl dns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Let me see how parse resolvectl dns
. Thanks for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bionic does not have resolvectl
, so that will be interesting....will figure out how to update the test.
@CalvoM Please update the commit message. As described in https://warthogs.atlassian.net/browse/CPC-3999 , the association of the DNS server with the secondary NIC is a problem as of at least Focal. I would also suggest that you launch Bionic and Xenial VMs to see how far back the fix needs to be ported. |
I think I've only noticed RoutesToDns=True default in 22.04+. Will test this out. Overall, I think maybe if this is an Azure-specific change, would it work to just conditionally add the use-dns flag here for all but the first nic (that is, idx != 0)? Thanks for finding a solution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, the approach looks good to me, thanks.
SF: 00380708 says that the networkd change of defauts starts in jammy and onwards, do we need to patch this new config out for focal, or is this a noop change for focal and back? @CalvoM, @danpdraper.
Use the primary nic ip address when creating an VM with multi nics. Needed for canonical/cloud-init#5180
@cjp256 Our investigation revealed that the association of the DNS server with the second NIC by
Assuming the foregoing is consistent with the implementation, DNS requests should be transmitted through both NICs on both Focal and Jammy hosts ( In any case, @CalvoM 's solution will remove the secondary NIC from the DNS resolution path and thus will address any associated resolution latency. |
Good question. Even though resolution latency does not appear to be affected on hosts running pre-Jammy releases, we know that DNS resolution through the second NIC is not supported, so a suggestion in the |
903fd44
to
f93dc8e
Compare
cloudinit/sources/DataSourceAzure.py
Outdated
@@ -1945,6 +1948,28 @@ def load_azure_ds_dir(source_dir): | |||
return (md, ud, cfg, {"ovf-env.xml": contents}) | |||
|
|||
|
|||
def configure_ephemeral_and_hotplug_device(netconfig: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably rename this function to something like add_dhcp_config_for_primary_and_secondary_nics_to_network_config
. The doc string should be updated accordingly.
f93dc8e
to
872674b
Compare
872674b
to
0e9748d
Compare
We had a discussion with @danpdraper and we see Xenial is not affected by this issue, so we will not backport it to xenial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @CalvoM, for fixing this. I have added some inline comments/requests.
else: | ||
ret = snapshot_instance.execute("resolvectl dns") | ||
assert ret.ok | ||
routes = yaml.safe_load(ret.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per RESOLVECTL(1) the output of this command is not yaml, the --json
flag does nothing in the dns
subcommand case and the routes can contain :
possible invalidating the use of yaml.safe_load
.
Not sure if the tests are going to hit this problem, but if we would like to be 100% safe here we should split by the first occurrence of :
to correctly parse this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I would try coming up with a logic to parse the output. I think I saw the output was something similar to
Global:
Link 4 (enP13718s1):
Link 3 (eth1): 168.63.129.16
Link 2 (eth0): 168.63.129.16
Thus I thought yaml parsing will be easier.
@@ -1945,6 +1950,30 @@ def load_azure_ds_dir(source_dir): | |||
return (md, ud, cfg, {"ovf-env.xml": contents}) | |||
|
|||
|
|||
def add_dhcp_config_for_primary_and_secondary_nics_to_network_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this config only applied to NICs with hv_netvsc
as driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a virtual machine (VM) is created in Azure, a synthetic network interface is created for each virtual NIC in its configuration.
The synthetic interface is a VMbus device and uses the netvsc driver.
Network packets that use this synthetic interface flow through the virtual switch in the Azure host and onto the datacenter's physical network.
The synthetic interface always has a name in the form eth\<n\>.
Depending on the Linux distribution, the VF interface might have a name in the form eth\<n\>.
Or it might have a different name in the form of enP\<n\> because of a udev rule that does renaming
Source: https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-how-it-works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that:
- Synthetic interfaces are identifiable by the
hv_netvsc
driver. - Synthetic interfaces are the only affected interface kinds affected by the dns problem?
Could please add a breadcrumb pointing to this, as a comment and/or in the commit message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synthetic interface always has a name in the form eth<n>.
Depending on the Linux distribution, the VF interface might have a name in the form eth<n>.
Or it might have a different name in the form of enP<n> because of a udev rule that does renaming
Your proposed solution only matches against eth*
names, should we also match against enP\<n\>
ones? Or is there a reason I do not see to restrict this to eth*
?
…hem. As from Bionic, systemd has associated DNS server with secondary NICs, this causes delay in the DNS resolution from Jammy and forwards. We see it evidently in the cloud-init logs where resolving the url takes way more seconds.
0e9748d
to
0a06fb5
Compare
That would simplify the fix, but per:
It looks like cloud-init cannot rely on positions in the IMDS response to determine which NIC is the primary one. Or is that ensured? |
The primary nic will always be first in the response. The ordering of secondary NICs are not guaranteed today (but someday...). That why I think all we need to do is simplify the change to just add the The driver/synthetic management is already handled in that method too. |
Closing this as #5314 superseded. |
Proposed Commit Message
Additional Context
From jammy, RouteToDns=true is the default for systemd-networkd.
This causes slow resolution of urls and failures in journalctl.
We need to limit the route to DHCP to the primary NIC.
Test Steps
Checklist
Merge type