The core library of the Tenzing project. tenzing-core provides facilities for interacting with CUDA + MPI programs as sequential decision problems. This facilitates optimizing CUDA + MPI programs using sequential decision strategies.
Two solvers are available
- tenzing-mcts: Uses Monte-Carlo tree search
- tenzing-dfs: Uses depth-first search
On a supported platform:
source load-env.sh
In any case:
mkdir build && cd build
cmake .. -DCMAKE_CUDA_ARCHITECTURES=70
make
Tests are split into two locations:
- unit tests may be defined in source files
- tests with a more "itegration" flavor are in
test/
To run tests, you can do
make test
ctest
tenzing-all
-ltc
: list tests cases-tc="a,b"
: only run test cases nameda
andb
This creates some CMake complexity, as the test functions present in static libraries will not be linked into the resulting test binary. Therefore, we use a CMake object library to generate the test binary, and then generate a static library from the object library. object library properties do not get propagated properly / at all, so we have to redefine what needs to be linked and included, etc
tenzing-core has been tested on the following platforms:
- NERSC perlmutter: g++ 10.3 / nvcc 11.4 / Cray MPICH 8.1.13
- Sandia vortex (similar to ORNL Lassen and OLCF Summit): g++ 7.5.0 / nvcc 10.1 / IBM Spectrum MPI
- Sandia ascicgpu
- Visit the API documentation in docs/api.md
ascicgpu
system documentation in docs/ascicgpu.mdvortex
system documentation in docs/vortex.mdperlmutter
ssytem documentation in docs/perlmutter.md
- python bindings (with pybind11)
See CONTRIBUTING.md for contribution guidelines.
- enable / disable CUDA / MPI
- isolate Ser/Des
- isolate platform assignments
- a
BoundOp
cannot produce thestd::shared_ptr<OpBase>
of it's unbound self, onlyOpBase
- can't ask an
std::shared_ptr<BoundOp>
forstd::shared_ptr<OpBase>
- maybe std::shared_from_this?
- can't ask an
- special status of
Start
andEnd
is a bit clumsy.- maybe there should be a
StartEnd : BoundOp
that they both are instead of separate classes- in the algs they're probably treated the same (always synced, etc)
- maybe there should be a
-
Platform
is a clumsy abstraction, since it also tracks resources that are only valid for a single order- e.g., each order requires a certain number of events, which can be resued for the next order
Please see NOTICE.md for copyright and license information.