DeepShare presents (1) a method to mitigate network contention by training a sophisticated scheduling policy using RL and (2) a framework to deploy it for efficient management of distributed DL jobs in GPU clusters.
TLDR: Distributed DL training on shared GPU clusters is prone to network contention between training jobs. This is because existing schedulers mainly focus on allocation of dedicated computation resources (e.g., GPU) but are often agnostic to shared network resources (e.g., PCIe, NVLink, and Infiniband). This can be addressed by incorporating a contention-aware scheduler that dynamically schedules and migrates jobs according to cluster-wide network contention. DeepShare presents an end-to-end system for training such efficient scheduling policies with RL to its deployment on GPU clusters. Scheduling policies trained with DeepShare (RL-base and RL-hybrid in above figure) show that training latency is improved by up to 20.7% compared to state-of-the-art schedulers.
- Refer to Installation for complete instructions on environment setup and installation.
- Refer to Quickstart for training scheduling policies with RL and deploying on GPU cluster.
- Refer to Examples for writing custom job scripts.
@inproceedings{ryu2023network,
title={Network contention-aware cluster scheduling with reinforcement learning},
author={Ryu, Junyeol and Eo, Jeongyoon},
booktitle={2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)},
pages={2742--2745},
year={2023},
organization={IEEE}
}
Please note that citation will be modified after paper is officially published in IEEE ICPADS 2023 proceedings.
Junyeol Ryu (junyeol@aces.snu.ac.kr)