This repo includes working code examples for our blog about Distributed Deep Learning Training with Kubernetes.
In both directories, you can find a Dockerfile, and one or two Kubernetes manifests. The image generated with the Dockerfile should be used in the manifests.
In nccl-tests/
, training.yaml
should be applied first, and after all pods are ready,
launcher.yaml
can be applied to trigger the tests.
You can read more in the blog!
In torchrun/
, there is only one manifest since there is no launcher.
Note that the Dockerfile here expects some training script.
You can read more in the blog!