Skip to content

Latest commit

 

History

History
45 lines (39 loc) · 4.22 KB

File metadata and controls

45 lines (39 loc) · 4.22 KB

Training System

System for deep learning training.

Training(Parallelism)

  • Class materials for a distributed systems lecture series [GitHub]
  • bytedance/byteps: A high performance and general PS framework for distributed training [GitHub]
  • PipeDream: Generalized Pipeline Parallelism for DNN Training (SOSP2019) [Paper] [Github]
  • Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks. [Paper] [GitHub]
    • Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. (ICML 2018)
  • Mesh-TensorFlow: Deep Learning for Supercomputers [Paper] [GitHub]
    • Shazeer, Noam, Youlong Cheng, Niki Parmar, Dustin Tran, et al. (NIPS 2018)
    • Summary: Data parallelism for language model
  • PyTorch-BigGraph: A Large-scale Graph Embedding System [Paper] [GitHub]
    • Lerer, Adam and Wu, Ledell and Shen, Jiajun and Lacroix, Timothee and Wehrstedt, Luca and Bose, Abhijit and Peysakhovich, Alex (SysML 2019)
  • Beyond data and model parallelism for deep neural networks [Paper] [GitHub]
    • Jia, Zhihao, Matei Zaharia, and Alex Aiken. (SysML 2019)
    • Summary: SOAP (sample, operation, attribution and parameter) parallelism. Operator graph, device topology and extution optimizer. MCMC search algorithm and excution simulator.
  • Device placement optimization with reinforcement learning [Paper]
    • Mirhoseini, Azalia, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. (ICML 17)
    • Summary: Using REINFORCE learn a device placement policy. Group operations to excute. Need a lot of GPUs.
  • Spotlight: Optimizing device placement for training deep neural networks [Paper]
    • Gao, Yuanxiang, Li Chen, and Baochun Li (ICML 18)
  • GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [Paper][GitHub] [News]
    • Huang, Yanping, et al. (arXiv preprint arXiv:1811.06965 (2018))
  • Horovod: Distributed training framework for TensorFlow, Keras, and PyTorch. [GitHub]
  • Distributed machine learning infrastructure for large-scale robotics research [GitHub] [Blog]
  • A Generic Communication Scheduler for Distributed DNN Training Acceleration [Paper] [BytePS]
    • PENG, Y., Zhu, Y., CHEN, Y., BAO, Y., Yi, B., Lan, C., Wu, C. and Guo, (SOSP 2019)
    • Summary: communication schedular

Training(Multi-jobs on cluster)

  • Gandiva: Introspective cluster scheduling for deep learning. [Paper]
    • Xiao, Wencong, et al. (OSDI 2018)
    • Summary: Improvet the efficency of hyper-parameter in cluster. Aware of hardware utilization.
  • Optimus: an efficient dynamic resource scheduler for deep learning clusters [Paper]
    • Peng, Yanghua, et al. (EuroSys 2018)
    • Summary: Job scheduling on clusters. Total complete time as the metric.
  • Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. [Paper] [dataset]
    • Jeon, Myeongjae, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang
  • Slurm: A Highly Scalable Workload Manager [GitHub]