HPC_BENCHMARK


To provide an adjustable set of inputs for performance testing, some
flexible input generation scripts are provided in the examples
directory.  These were written for the Trinity and CORAL procurement
benchmarks, but can be used on any platform to assess relative
performance.

TRINITY BENCHMARK:  gold

To generate inputs for the Trinity procurement benchmark, use the
files in examples/gold_benchmark.  Input the number of atoms and the
target number of MPI tasks into the Perl script, e.g.

qbox_gold_makeinputs.pl 1000 16384

CORAL BENCHMARK:  magnesium oxide

To generate inputs for the CORAL procurement benchmark, use the
files in examples/mgo_benchmark.  Input the number of atoms and the
target number of MPI tasks into the Perl script, e.g.

qbox_mgo_makeinputs.pl 4000 65536

RUNNING QBOX

These scripts will generate .i and .sys files for the closest number
of atoms that form a uniform rectangular crystal, which should be
copied along with the provided .xml pseudopotential file(s) to the run
directory.  Build the code as described in the INSTALLING file in this
directory and run as described in the RUNNING file.

Timings are printed to standard output at the end of the run.  The
total iteration time defines the overall performance, while the
scfloop timer excludes time in set up and post-processing (allowing
one to estimate the average time per iteration from short runs more
accurately):

<timing where="run"       name=" iteration "    min="1372.748 " max="1373.275 "/>
...
<timing where="run"       name=" scfloop   "    min="1311.581 " max="1312.094 "/>

with timings for individual routines being listed as well.  Most of
the simulation time is typically spent in the timing blocks labeled
charge (FFT, Alltoallv), diag (pzheevd), gram (pzherk, pzpotrf,
pztrsm) and psda_residual (pzgemm).

FIGURES OF MERIT

Note that Qbox scales as O(N^3), where N is the number of atoms, so
larger systems will take significantly longer to run on the same
number of tasks.  On current machines, 1000-2000 gold atoms or
4000-8000 MgO atoms has been found to be tractable on 8k-128k tasks.

The Figure of Merit for the Trinity benchmark is the total iteration
time for a "run 0 1 3" command, computed from the max "iteration" time
shown above.

The Figure of Merit for the CORAL benchmark is the number of scf
iterations that can be run per second for a given magnesium oxide
system.  The input files generated by the benchmark Perl script will
create a .i file with a "run 0 5" command, which will run five scf
iterations.  The time per scf iteration can be computed by dividing
the max "scfloop" time by 5 and then taking the inverse.  For example,
if the max scfloop time is 100 seconds, the time per scf iteration
will be 20 seconds and the FOM will be 1/20 = 0.05 iterations per
second.

TUNING PARALLEL PERFORMANCE

Qbox uses a rectangular MPI process grid defined in the .i file by the
variable nrowmax, which sets the number of process rows.  The script
will attempt to choose a reasonable value, but better performance may
be achieved by increasing or decreasing it.  The number of process
rows should be greater than the number of process columns, typically
much greater.  

Some ScaLAPACK routines do not scale well beyond 100k+ MPI tasks, so
threading can be used to run on more cores.  The parallel FFT and a
few other loops are threaded in Qbox with OpenMP, but most of the
threading speedup comes from the use of threaded linear algebra
kernels (blas, lapack, ESSL).