Skip to content

Commit

Permalink
version 1
Browse files Browse the repository at this point in the history
  • Loading branch information
LiamBindle committed Aug 11, 2021
1 parent 403159b commit c8dd469
Show file tree
Hide file tree
Showing 5 changed files with 33 additions and 37 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
all: simple_xy_wr simple_xy_rd simple_xy_rd_retry

clean:
rm -f simple_xy_wr simple_xy_rd simple_xy.nc *.mod
rm -f simple_xy_wr simple_xy_rd simple_xy.nc *.mod simple_xy_rd_retry

FFLAGS = $(shell nf-config --cflags) $(shell nf-config --fflags) -g
LDFLAGS = $(shell nf-config --flibs)
Expand Down
57 changes: 24 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,34 @@
## nf_retry

## Where it goes wrong

[Here](https://github.com/GEOS-ESM/MAPL/blob/d5009302d5ebac669cc7ff93db5134154a2c7a88/pfio/NetCDF4_FileFormatter.F90#L256)L

```fortran
if (this%parallel) then
!$omp critical
status = nf90_open(file, IOR(omode, NF90_MPIIO), comm=this%comm, info=this%info, ncid=this%ncid)
!$omp end critical
_VERIFY(status)
else
!$omp critical
status = nf90_open(file, IOR(omode, NF90_SHARE), this%ncid) ! this line can fail
!$omp end critical
_VERIFY(status)
```

## NetCDF-Fortran's Documentation of Error Handling
This is a module that provides `nf90_open_with_retries`. This function wraps `nf90_open`
with the following functionality

### 1.6 Error Handling
1. NetCDF errors are printed to the console
2. Automatically retry opening the file

The netCDF library provides the facilities needed to handle errors in a flexible way. Each netCDF function returns an integer status value. If the returned status value indicates an error, you may handle it in any way desired, from printing an associated error message and exiting to ignoring the error indication and proceeding (not recommended!). For simplicity, the examples in this guide check the error status and call a separate function to handle any errors.
This is useful for working around intermittent and erroneous i/o errors in HPC cluster filesystems like GPFS.

The NF90_STRERROR function is available to convert a returned integer error status into an error message string.
`nf_retry` is disabled by default and `nf90_open_with_retries` behaves exactly like `nf90_open` (except error
are automatically printed). `nf_retry` can be enabled and configured by creating `nf_retry.nml`:

Occasionally, low-level I/O errors may occur in a layer below the netCDF library. For example, if a write operation causes you to exceed disk quotas or to attempt to write to a device that is no longer available, you may get an error from a layer below the netCDF library, but the resulting write error will still be reflected in the returned status value.
```
&nf_retry
nf_retry_wait=1, ! Seconds to wait before retry
nf_retry_max_tries=3, ! Max number of retries before failure
nf_retry_catch=0,1,3,2, ! NetCDF error codes to catch
nf_retry_catch_all=F, ! Catch all NetCDF errors?
/
```

### Summary
### Implementing nf_retry

Do
Use the `nf_retry` module and replace `nf90_open` calls with `nf90_open_with_retries`.

```fortran
status = nf90_open(...)
if(status /= nf90_noerr) then
print *, trim(nf90_strerror(status))
endif
NF90_STRERROR
```diff
+ use nf_retry
...
- status = nf90_open(path, mode, ncid)
+ status = nf90_open_with_retries(path, mode, ncid)
```

## References
1. https://www.unidata.ucar.edu/software/netcdf/docs-fortran/nc_f77_interface_guide.html#f77_NF_OPEN_
2. https://www.unidata.ucar.edu/software/netcdf/docs-fortran/f90_datasets.html#f90-nf90_strerror
Make sure to add `nf_retry.f90` to you build.
7 changes: 7 additions & 0 deletions nf_retry.f90
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ module nf_retry
integer :: nf_retry_wait
integer :: nf_retry_max_tries
logical :: nf_retry_catch_all

logical :: nf_retry_is_initialized = .false.
contains
function nf90_open_with_retries(path, mode, ncid) result(status)
implicit none
Expand All @@ -18,6 +20,9 @@ function nf90_open_with_retries(path, mode, ncid) result(status)

status = nf90_open(path, mode, ncid)
if(status /= nf90_noerr) then
if (.not. nf_retry_is_initialized) then
call nf_retry_init()
end if
print '("nf90_open: Error code (", I0,") opening ", A, " [", A,"]")',status,trim(path),trim(nf90_strerror(status))
do while ( ( status /= nf90_noerr ) .and. &
( any(nf_retry_catch==status) .or. nf_retry_catch_all ) .and. &
Expand Down Expand Up @@ -52,6 +57,8 @@ subroutine nf_retry_init()
open (unit=fh, file=filepath)
read (nml=nf_retry, unit=fh)
end if

nf_retry_is_initialized = .true.
end subroutine

end module nf_retry
2 changes: 1 addition & 1 deletion nf_retry.nml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
&nf_retry
nf_retry_wait=1, ! Seconds to wait before retry
nf_retry_max_tries=3, ! Max number of retries before failure
nf_retry_catch=0,1,3,2, ! NetCDF error codes to catch
nf_retry_catch=0,1,2,3, ! NetCDF error codes to catch
nf_retry_catch_all=F, ! Catch all NetCDF errors?
/
2 changes: 0 additions & 2 deletions simple_xy_rd_retry.f90
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,6 @@ program simple_xy_rd
! print '("nf_retry: Caught netcdf error code(", I0,"): ",A)',NX,nf90_strerror(NX)
! print '("nf_retry: Retrying in ", I0,"s.")',11


call nf_retry_init()
! stop "Done"

! Open the file. NF90_NOWRITE tells netCDF we want read-only access to
Expand Down

0 comments on commit c8dd469

Please sign in to comment.