Releases: DragonHPC/dragon
Nidhogg
Dragon 0.10 Release Summary
This release adds support for infiniband networks, checkpointing to the distributed dictionary, and provides initial support for telemetry monitoring via a Grafana frontend. Other features include:
- Complete overhaul of ProcessGroup to improve reliability and user debugging
- Performance improvements in HSTA
- Better exception handling and logging
- Better support for specifying placement of processes
- Addition of a
dragon-activate
script to make it easier to set-up runtime environment - Improved stability for the Distributed Dictionary and runtime overall
Two sets of packages are below. The ones with "HSN" in the name include the RDMA-based transport feature for both slingshot and infiniband networks. The other packages use the TCP-based transport and will work on generic clusters and single node/laptops/etc. Note that the TCP-based transport package may not scale for some use cases above 16 nodes.
Raiju
Dragon 0.9 Release Summary
This release augments scalability and performance for launching 10k or more processes and greatly improves distributed dictionary performanace. Other highlighted features are:
- Improvements to ProcessGroup to provide better user experience and performance
- Improve launch time for large numbers of processes by enabling batch launch
- New implementation for distributed dictionary that improves performance and scalability
- Support for placement of processes via Policy API
- Bug fix for launching a Pool of pools
Two sets of packages are below. The ones with "CRAYEX" in the name include the RDMA-based transport feature and are for Cray EX systems only. The other packages use the TCP-based transport and will work on generic clusters and single node/laptops/etc. Note that the TCP-based transport package may not scale for some use cases above 16 nodes.
Watatsumi
Dragon 0.8 Release Summary
This package introduces new features that enhance portability, further optimize performance at scale, and increase usability with packages that rely on Python multiprocessing derivatives. Highlighted new features are:
- Ability for high speed transport agent to use multiple NICs (Cray EX packages only)
- Use of libfabric for high speed transport RDMA operations (Cray EX packages only)
- Improved performance of launcher start up time for allocations of more than ~100 nodes.
- Enhanced testing pipeline for Python 3.10 and 3.11
- Added documentation for Overlay Network and a cookbook entry for using the PyTorch native Dataloader over a Distributed Dictionary
- Fixed PMI patching for PBS/Pals, Overlay Network port conflict and exit signaling, detach/destroy of memory pools.
- Fixed numpy scaling test to be able to efficiently scale to 64+ nodes
Two sets of packages are below. The ones with "CRAYEX" in the name include the RDMA-based transport feature and are for Cray EX systems only. The other packages use the TCP-based transport and will work on generic clusters and single node/laptops/etc. Note that the TCP-based transport package may not scale for some use cases above 16 nodes.
Zennyo
Dragon 0.61 Release Summary
This package is the first to extend Dragon beyond support for Python multiprocessing. The key new feature is support for running collections of executables, including executables that require support for PMI (e.g., MPI). PMI support is currently limited to executables using Cray PMI, such as those linked with Cray MPICH. The process group feature is also utilized for scalable multiprocessing Pool, which can now scale to thousands of workers. Highlighted new features are:
- ProcessGroup API for scalable management of processes
- Initial support for managing process requiring PMI (e.g., MPI)
- Rewrite of mp.Pool utilizing ProcessGroup
- mp.Array
- HPC workflow cookbook entry orchestrating MPI and Python processes
- Processing pipeline cookbook example
- LICENSE is now MIT
To extract the package,
tar -zxvf dragon-0.61-100f20e0.tar.gz
Then you can change to the dragon-0.61 directory and follow directions given in the INSTALL.md file to install it on your system.
NOTE: Installing this package requires a Linux environment with a compatible GLIBC library. Additional requirements are listed in our documentation at dragonhpc.org under the heading "Getting Started".