Our PyTorch code uses custom CUDA extensions to speed up some of the network layers. Getting these to run can sometimes be a hassle.
This page aims to give guidance on how to diagnose and fix run-time problems related to these extensions.
- Try Docker first! Ensure you can successfully run our models using the recommended Docker image. Follow the instructions in README.md to get it running.
- Can't use Docker? Read on..
Make sure you've installed everything listed on the requirements section in the README.md. The key components w.r.t. custom extensions are:
- CUDA toolkit 11.1 or later (this is not the same as
cudatoolkit
from Conda).- PyTorch invokes
nvcc
to compile our CUDA kernels.
- PyTorch invokes
- ninja
- PyTorch uses Ninja as its build system.
- GCC (Linux) or Visual Studio (Windows)
- GCC 7.x or later is required. Earlier versions such as GCC 6.3 are known not to work.
The PyTorch package contains the required CUDA toolkit libraries needed to run PyTorch, so why is a separate CUDA toolkit installation required? Our models use custom CUDA kernels to implement operations such as efficient resampling of 2D images. PyTorch code invokes the CUDA compiler at run-time to compile these kernels on first-use. The tools and libraries required for this compilation are not bundled in PyTorch and thus a host CUDA toolkit installation is required.
- Completely remove:
$HOME/.cache/torch_extensions
(Linux) orC:\Users\<username>\AppData\Local\torch_extensions\torch_extensions\Cache
(Windows) and re-run CREPS python code. - Run ninja in
$HOME/.cache/torch_extensions
to see that it builds. - Inspect the
build.ninja
in the build directories under$HOME/.cache/torch_extensions
and check CUDA tools and versions are consistent with what you intended to use.