GitHub - UK-MAC/CloverLeaf_CUDAFortran: CUDA Fortran port of CloverLeaf for 1 GPU

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Makefile		Makefile
PdV.cuf		PdV.cuf
PdV_kernel.cuf		PdV_kernel.cuf
README		README
accelerate.cuf		accelerate.cuf
accelerate_kernel.cuf		accelerate_kernel.cuf
advec_cell_driver.cuf		advec_cell_driver.cuf
advec_cell_kernel.cuf		advec_cell_kernel.cuf
advec_mom_driver.cuf		advec_mom_driver.cuf
advec_mom_kernel.cuf		advec_mom_kernel.cuf
advection.f90		advection.f90
build_field.f90		build_field.f90
calc_dt.cuf		calc_dt.cuf
calc_dt_kernel.cuf		calc_dt_kernel.cuf
clover.f90		clover.f90
clover_bm.in		clover_bm.in
clover_bm16.in		clover_bm16.in
clover_bm16_short.in		clover_bm16_short.in
clover_bm_short.in		clover_bm_short.in
clover_leaf.f90		clover_leaf.f90
data.f90		data.f90
definitions.cuf		definitions.cuf
field_summary.f90		field_summary.f90
field_summary_kernel.cuf		field_summary_kernel.cuf
flux_calc.f90		flux_calc.f90
flux_calc_kernel.cuf		flux_calc_kernel.cuf
generate_chunk.f90		generate_chunk.f90
generate_chunk_kernel.f90		generate_chunk_kernel.f90
hydro.cuf		hydro.cuf
ideal_gas.cuf		ideal_gas.cuf
ideal_gas_kernel.cuf		ideal_gas_kernel.cuf
initialise.f90		initialise.f90
initialise_chunk.f90		initialise_chunk.f90
initialise_chunk_kernel.f90		initialise_chunk_kernel.f90
pack_kernel.f90		pack_kernel.f90
parse.f90		parse.f90
read_input.f90		read_input.f90
report.f90		report.f90
reset_field.f90		reset_field.f90
reset_field_kernel.cuf		reset_field_kernel.cuf
revert.f90		revert.f90
revert_kernel.cuf		revert_kernel.cuf
start.f90		start.f90
timer.f90		timer.f90
timestep.f90		timestep.f90
update_halo.f90		update_halo.f90
update_halo_kernel.cuf		update_halo_kernel.cuf
viscosity.cuf		viscosity.cuf
viscosity_kernel.cuf		viscosity_kernel.cuf
visit.f90		visit.f90

Repository files navigation

This directory contain a CUDA Fortran port of the serial Cloverleaf benchmark for a single 
GPU.

Managed Memory
--------------

This code utilizes the Managed Memory feature introduced in the CUDA 6.0 Toolkit
and in the PGI 14.7 compilers.  Managed Memory is a pool of memory accessible 
to both CPU and GPU using a single variable, which eliminates the need for separate host 
and device versions of data along with coding explicit data movement between host and 
device.  In CUDA Fortran, allocation into this pool of memory is achieved using the 
"managed" variable attribute in declarations (see definitions.cuf).  Managed Memory 
requires use of the 6.0 CUDA libraries and is available on devices with compute capability 
of 3.0 or higher.

One caveat in using Managed Memory is that on multi-GPU systems where any two GPUs are NOT 
peer-to-peer capable, the system falls back to using zero-copy memory which will typically
result in a large performance degradation.  Since this code utilizes a single GPU, the 
zero-copy fallback can be avoided by setting the environment variable CUDA_VISIBLE_DEVICES
to the appropriate GPU, or by setting the environment variable CUDA_MANAGED_FORCE_DEVICE_ALLOC
to a non-zero value.  (To determine the whether all pairs of GPUs on your system are 
peer-to-peer capable, compile and run the example code p2pAccess found in: 

/opt/pgi/*/2015/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter4/P2P/p2pAccess/)


Texture Cache
-------------

On devices of compute capability 3.5 and higher, kernel arguments declared as 
"intent(in)" will be routed through the read-only texture cache via the LDG 
instruction.  As a result of this feature, there are neither use of explicit 
textures nor shared memory in any kernels.  On devices of compute capability 3.0, 
the code will run slower as the texture cache is not used.


Test Cases
----------

Four different input files are included with the code are those recommended to test for 
correctness.  Cloverleaf expects the input file to be named clover.in, so copy one of 
these files to clover.in before running a test.

The kinetic energy, as displayed in the output file clover.out every 10 timesteps and 
at the end of the run, is usually the most sensitive state variable and can be used as 
a correctness test.  The reference values for the kinetic energy for the last time step 
of these runs, along with approximate wall-clock times for a system with a K20 are listed 
below:

Input	       	     	  Kinetic Energy       Wall-clock time (s)
clover_bm_short.in	  0.1193E+01	       1.13 
clover_bm.in		  0.2590E+01	       39
clover_bm16_short.in	  0.3075E+00	       17
clover_bm16.in		  0.4854E+01	       580