Watatsumi
Pre-release
Pre-release
Dragon 0.8 Release Summary
This package introduces new features that enhance portability, further optimize performance at scale, and increase usability with packages that rely on Python multiprocessing derivatives. Highlighted new features are:
- Ability for high speed transport agent to use multiple NICs (Cray EX packages only)
- Use of libfabric for high speed transport RDMA operations (Cray EX packages only)
- Improved performance of launcher start up time for allocations of more than ~100 nodes.
- Enhanced testing pipeline for Python 3.10 and 3.11
- Added documentation for Overlay Network and a cookbook entry for using the PyTorch native Dataloader over a Distributed Dictionary
- Fixed PMI patching for PBS/Pals, Overlay Network port conflict and exit signaling, detach/destroy of memory pools.
- Fixed numpy scaling test to be able to efficiently scale to 64+ nodes
Two sets of packages are below. The ones with "CRAYEX" in the name include the RDMA-based transport feature and are for Cray EX systems only. The other packages use the TCP-based transport and will work on generic clusters and single node/laptops/etc. Note that the TCP-based transport package may not scale for some use cases above 16 nodes.