A collection of simple load balancers for MPI made in OpenMP manner.
All load balancers are available via SLB4MPI.hpp
header which provides <Load-Balancer-Type>LoadBalancer
types.
Each <Load-Balancer-Type>LoadBalancer
implements abstract interface defined in abstract_load_balancer.hpp.
See possible Load-Balancer-Type
s in the next section.
Initialization can be done in the following manner, for example (StaticLoadBalancer
will be used):
#include <SLB4MPI.h>
...
StaticLoadBalancer slb(MPI_COMM_WORLD, 1, 100, 2, 4);
That will initialize load balancer slb
with range from lower bound (1
) to upper bound (100
) and with min (2
) and max (4
) chunk sizes which are optional which will work on communicator comm
.
Simpler initialization is also possible with default min and max chuck sizes:
StaticLoadBalancer slb(MPI_COMM_WORLD, 1, 100);
This is a barrier for the whole communicator.
Then, one can ask range for computing with get_range
method which returns range's start and end by arguments, and the method returns logical value that signals is there something to compute.
So, the pattern of usage this routine is:
if (slb.get_range(range_start, range_end)) { // returns [ range_start, range_end ]
...
}
Single call may return only a part of whole range, so, proper call should be inside of while
-loop.
After finishing loop, load balancer slb
must be destroyed.
Internally, it uses RAII, so it most of cases you do not think about destructing it.
For Runtime
load balancer, a proper load balancer must be specified via first argument during initialization:
std::string rlbtype = std::string("dynamic");
RuntimeLoadBalancer rlb(rlbtype, MPI_COMM_WORLD, 1, 100, 2, 4);
See possible values in the next section.
All load balancers are available in SLB4MPI
module which provides <load-balancer-type>_load_balancer_t
types.
Each <load-balancer-type>_load_balancer_t
implements abstract interface defined in abstract_load_balancer.f90.
See possible load-balancer-type
s in the next section.
Type definition can be done in the following manner, for example:
use SLB4MPI
...
type(static_load_balancer_t) :: lb
Here, static load balancer will be used.
Now, before usage load balancer lb
, one must initialize the load balancer lb
with range from lower bound to upper bound and with min and max chunk sizes which are optional which will work on communicator comm
.
For range from 1 to 1000 with min chunk size equals 6 and max chunk size equals 12, the initialization will look as follows:
call lb%initialize(comm, 1_8, 1000_8, 6_8, 12_8)
The simpler call is also possible, where min and max chunk sizes will be determined automatically:
call lb%initialize(comm, 1_8, 1000_8)
This call is a barrier for the whole communicator.
Then, one can ask range for computing with get_range
call which returns range's start and end by arguments, and the call returns logical value that signals is there something to compute.
So, the pattern of usage this routine is:
if (lb%get_range(range_start, range_end)) then
...
endif
Single call may return only a part of whole range, so, proper call should be inside of do
-loop.
After finishing loop, load balancer lb
must be destroyed before next usage via clear
call:
call lb%clean()
For runtime
load balancer, default load balancer can be changed with SLB4MPI_set_schedule
call in the following way before its initialization:
call SLB4MPI_set_schedule("guided")
static
local_static
dynamic
guided
work_stealing
runtime
Typenames of different balancers in diffent languages:
load balancer | Typename in Fortran | Typename in C++ |
---|---|---|
static |
static_load_balancer_t |
StaticLoadBalancer |
local_static |
local_static_load_balancer_t |
LocalStaticLoadBalancer |
dynamic |
dynamic_load_balancer_t |
DynamicLoadBalancer |
guided |
guided_load_balancer_t |
GuidedLoadBalancer |
work_stealing |
work_stealing_load_balancer_t |
WorkStealingLoadBalancer |
runtime |
runtime_load_balancer_t |
RuntimeLoadBalancer |
Iterations are divided into chunks of size max_chunk_size
, and the chunks are assigned to the MPI ranks in the communicator in a round-robin fashion in the order of the MPI rank.
Each chunk contains max_chunk_size
iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations.
If min_chunk_size
is not specified, it defaults to 1
.
If max_chunk_size
is not specified, the iteration range is divided into chunks that are approximately equal in size, but no less than min_chunk_size
.
Iterations are divided into chunks of size max_chunk_size
, and the chunks are assigned to the MPI ranks in the communicator in a continuous fashion in the order of the MPI rank.
Each chunk contains max_chunk_size
iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations.
If min_chunk_size
is not specified, it defaults to 1
.
If max_chunk_size
is not specified, the iteration range is divided into chunks that are approximately equal in size, but no less than min_chunk_size
.
Each MPI rank executes a chunk, then requests another chunk, until no chunks remain to be assigned.
Each chunk contains min_chunk_size
iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations.
If min_chunk_size
is not specified, it defaults to 1
.
Each MPI rank executes a chunk, then requests another chunk, until no chunks remain to be assigned.
For a min_chunk_size
of 1
, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1
but no more than max_chunk_size
iterations.
For a min_chunk_size
with value k
> 1
, the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k
iterations but no more than max_chunk_size
iterations
(except for the chunk that contains the sequentially last iteration, which may have fewer than k
iterations).
If min_chunk_size
is not specified, it defaults to 1
.
If max_chunk_size
is not specified, the value is determined as average number of elements per MPI rank, but no less than min_chunk_size
.
Iterations are divided into chunks of size max_chunk_size
, and the chunks are assigned to the MPI ranks in the communicator in a continuous fashion in the order of the MPI rank.
Each chunk contains max_chunk_size
iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations.
If MPI rank finished its chunks, it starts to steal tasks from other MPI ranks in round-robin fashion in the order of MPI ranks.
These chunks contains min_chunk_size
iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations for victim MPI rank.
If min_chunk_size
is not specified, it defaults to 1
.
If max_chunk_size
is not specified, the iteration range is divided into chunks that are approximately equal in size, but no less than min_chunk_size
.
The load balancer is determing by SLB4MPI_set_schedule
procedure.
The list of possible values passed as string:
env
(default) selects slice algorithm bySLB4MPI_LOAD_BALANCER
environment variable;static
selectsstatic
load balancer;local_static
selectslocal_static
load balancer;dynamic
selectsdynamic
load balancer;guided
selectsguided
load balancer;work_stealing
selectswork_stealing
load balancer.
SLB4MPI_LOAD_BALANCER
environment variable accepts the following values:
- not set: runtime load balancer will use
static
load balancer; - empty string: runtime load balancer will use
static
load balancer; static
: runtime load balancer will usestatic
load balancer;local_static
: runtime load balancer will uselocal_static
load balancer;dynamic
: runtime load balancer will usedynamic
load balancer;guided
: runtime load balancer will useguided
load balancer;work_stealing
: runtime load balancer will usework_stealing
load balancer;- other values: runtime load balancer will use
static
load balancer.
The collection is assumed to be compiled in source tree of parent project with passing all flags from it. Additional flags, which are required for compiling of collection, are kept inside. Currently, only CMake build system is supported.
To use SLB4MPI from your project, add the following lines into your CMakeLists.txt
for fetching the library:
include(FetchContent)
FetchContent_Declare(SLB4MPI
GIT_REPOSITORY https://github.com/foxtran/SLB4MPI.git
GIT_TAG 0.0.2
)
FetchContent_MakeAvailable(SLB4MPI)
FetchContent_GetProperties(SLB4MPI SOURCE_DIR SLB4MPI_SOURCE_DIR)
set(SLB4MPI_ENABLE_CXX ON) # or OFF
set(SLB4MPI_ENABLE_Fortran ON) # or OFF
set(SLB4MPI_WITH_MPI ON) # or OFF
add_subdirectory(${SLB4MPI_SOURCE_DIR} ${SLB4MPI_SOURCE_DIR}-binary)
After this, you can link the library with your Fortran application or library (SLB4MPI_ENABLE_Fortran
must be ON
):
target_link_libraries(<TARGET> PUBLIC SLB4MPI::SLB4MPI_Fortran)
or with C++ application or library (SLB4MPI_ENABLE_CXX
must be ON
):
target_link_libraries(<TARGET> PUBLIC SLB4MPI::SLB4MPI_CXX)
Useful flags that changes behaviour of library:
SLB4MPI_ENABLE_CXX
enables/disables MPI load balancers for C++ language (requires C++14)SLB4MPI_ENABLE_Fortran
enables/disables MPI load balancers for Fortran language (requires Fortran 2003)SLB4MPI_WITH_MPI
enables/disables support of MPI for MPI/non-MPI builds
See example for Fortran and C++.
NOTE: The library does not provide ILP64 support.
- In Fortran,
clean
call is synchonization point. In C++, it is an end of scope containing load balancer(s).
- Load balancers
dynamic
,guided
,work_stealing
as well asruntime
may have significant performance issues. Report on Intel Forum MPI_Accumulate
is used instead ofMPI_Put
. Report on Intel ForumI_MPI_ASYNC_PROGRESS=1
leads to runtime fails. Report on Intel Forum- NOTE: no ILP64 support.
Not tested yet.