Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use host networking #29

Merged
merged 2 commits into from
Aug 17, 2023
Merged

Use host networking #29

merged 2 commits into from
Aug 17, 2023

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Aug 17, 2023

NB: based on feat/hostport as that is required for host networking.

Uses host networking to avoid MPI errors and give better network performance.

Using host networking means the pod's hostname = k8s node name, e.g. sbtest-worker-a438ab51-c69dx. This isn't "slurm hostlist expression compatible" so e.g. sinfo can't contract the nodenames. This PR uses the downward API to inject the pod name (which is hostlist expression compatible e.g. slurmd-0 as using a StateFullSet) into the container's environment vars, and explicitly sets the slurm'd nodename on startup using slurmd -N <nodename>.

Example of performance changes on arcus using portal-internal network (i.e. not RoCE), showing 0-byte, max bandwidth max message size values from srun-launched IMB-MPI1 pingpong using openmpi in image:

Default CNI :

       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        43.03         0.00
...
      1048576           40      5046.30       207.79
...
      4194304           10     24296.99       172.63

Host networking:

       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        34.75         0.00
...
      2097152           20      5929.08       353.71
      4194304           10     12546.33       334.31

(note in both cases the K8s node VMs are on the same hypervisor host).

@sjpb sjpb marked this pull request as ready for review August 17, 2023 09:02
Base automatically changed from feat/hostport to main August 17, 2023 09:21
@sjpb
Copy link
Collaborator Author

sjpb commented Aug 17, 2023

@sd109 checked OOD dashboard and jobs etc work anyway.

@sd109 sd109 self-requested a review August 17, 2023 09:55
Copy link
Collaborator

@sd109 sd109 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, lgtm

@sjpb sjpb merged commit def4a77 into main Aug 17, 2023
2 checks passed
@sjpb sjpb deleted the feat/hostnetwork branch August 17, 2023 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants