This README contains the code base for Rui Pan's final project report: Cautiously Aggressive GPU Space Sharing to Improve Resource Utilization and Job Efficiency.
Some of the prerequisites for replicating the results include:
- An NVIDIA GPU with Volta architecture
- Python 3.8 nightly build
- CUDA-compatible PyTorch & TorchVision
This repo contains:
/data
: Source data for running the workloads. It should be set up as follows:/imagenet
: ImageNet Dataset for resnet50 workloads/ml-20m
: MovieLens 20M Dataset for recommendation/recoder workloadswikitext2
: WikiText-2 Dataset for language modeling workloads
/latex
: LaTex files for editing the report on Overleaf/output
: Core-specific utilizations of workloads produced using an earlier version of the profiler/tables
: Shell scripts for replicating the profiling results in various tables/workloads
: Common DL/HPC workloads used in the evaluations. A lot of these are copied from Gavel.plotting.ipynb
: Jupyter Notebook that produces all figures in the reportprofiler.py
: Profiler parser wrapped around nvprofpymps.py
: Provides Python access to NVIDIA CUDA Multi-Process Service (MPS)README.md
: Well, of course I know him. He's me.report.pdf
: PDF version of the final report