Comparison reproducibility #2

ivan-pi · 2024-02-05T08:53:52Z

I run your comparison on my Mac, using gfortran v13:

$ make
gfortran-13 -o index_array -Wall -O3 index_array.f90
gfortran-13 -o whole_array -Wall -O3 whole_array.f90
whole_array.f90:34:68:

   34 |   real ( kind = 8 ), dimension ( Nx, Ny ) :: r, con, dfdcon, lap_con, dummy_con
      |                                                                    1
Warning: Unused variable 'lap_con' declared at (1) [-Wunused-variable]
$ ./index_array 
 ------------------------------------------------------
  Indexed array compute time =      8.723 seconds.
$ ./whole_array 
 ------------------------------------------------------
  Whole array computed time =     13.959 seconds.

The indexed version looks faster.

Shahid718 · 2024-02-05T09:22:16Z

It is faster for the gfortran v13! I did that comparison a few months back with version 13 too and found a similar conclusion as yours.

My assumption is due to do concurrent there is a significant effort to improve performance in the do loop.

Here is my output for the gfortran 12, windows OS. I did it just now.

ivan-pi · 2024-02-05T09:43:04Z

The indexed version contains some errors.

The following two lines,

        ! adjust concentration in range
        if ( con(i,j) >= 0.99999 )  con(i,j) = 0.99999
        if ( con(i,j) < 0.00001 )   con(i,j) = 0.00001

should be inside of the loop.

If I'm not mistaken you also need to calculate the entire laplacian of concentration before using it to update the phi values. This is for the indexed version.

ivan-pi · 2024-02-05T11:03:10Z

Here is a slightly less legible version, but it performs roughly 4x time faster for me. I'm not familiar with phase-field simulations, but the output looks like what is expected.

$ make FFLAGS="-Wall -O2 -march=native" index_array     
gfortran-13 -o index_array -Wall -O2 -march=native index_array.f90
$ time ./index_array 
 ------------------------------------------------------
  Indexed array compute time =      1.908 seconds.

real	0m2.062s
user	0m1.965s
sys	0m0.009s

program fd_ch_test

  implicit none

  ! ===========================================================================
  !                                parameters
  ! ===========================================================================

  integer, parameter :: ik = kind(0)
  integer, parameter :: wp = kind(1.0d0)

  integer(kind=ik), parameter :: Nx = 260 , Ny = 260, dx = 1, dy = 1
  
  real(kind=wp), dimension (0:Nx+1, 0:Ny+1, 2) :: con
  real(kind=wp), dimension (0:Nx+1, 0:Ny+1) :: r, lap_con

  integer(kind=ik) :: nsteps, istep, flip
  real(kind=wp) :: dt, c0, mobility, grad_coef, noise, A
  real(kind=wp) :: start , finish

  flip = 1
  nsteps = 30000
  dt = 0.01_wp
  c0 = 0.4_wp
  mobility = 1.0_wp
  grad_coef = 0.5_wp
  noise = 0.02_wp
  A = 1.0_wp

  ! ===========================================================================
  !                            initial microstucture
  ! ===========================================================================

  call random_number( r )
  con(:,:,flip) = c0 + noise * (0.5_wp - r)

  call periodic_halo(nx,ny,con(:,:,flip))

  ! ===========================================================================
  !                         evolution of microstructure 
  ! ===========================================================================

  call cpu_time ( start )

  time_loop: do istep = 1, nsteps

      select case(flip)
      case(1)
         call update(nx,ny,cold=con(:,:,1),cnew=con(:,:,2),wrk=lap_con, &
            dt=dt,mobility=mobility,A=A,grad_coef=grad_coef)
      case(2)
         call update(nx,ny,cold=con(:,:,2),cnew=con(:,:,1),wrk=lap_con, &
            dt=dt,mobility=mobility,A=A,grad_coef=grad_coef)
      end select
      flip = 3 - flip

  end do time_loop

  call cpu_time ( finish )

  ! ===========================================================================
  !                     print computed time on the screen 
  ! ===========================================================================


  print *,'------------------------------------------------------'
  print '("  Indexed array compute time = ", f10.3," seconds." )', finish - start

   block
      integer :: unit, i, j
      open(newunit=unit,file="results.out",action="write")
      do i = 1, nx
         do j = 1, ny
            write(unit, *) (i-1), (j-1), con(i,j,flip)
         end do
         write(unit,*) ""
      end do
      close(unit)
   end block

contains

   subroutine periodic_halo(nx,ny,con)
      integer, intent(in) :: nx, ny
      real(wp), intent(inout) :: con(0:nx+1,0:ny+1)

      ! Fix corners
      con(0,0) = con(nx,ny)
      con(Nx+1,0) = con(1,ny)
      con(0,Ny+1) = con(nx,1)
      con(Nx+1,Ny+1) = con(1,1)

      ! Fix east and west sides
      con(0,1:ny) = con(nx,1:ny)
      con(Nx+1,1:ny) = con(1,1:ny)

      ! Fix north and south sides
      con(1:nx,0) = con(1:nx,Ny)
      con(1:nx,ny+1) = con(1:nx,1)

   end subroutine

   subroutine update(nx,ny,cold,cnew,wrk,dt,A,mobility,grad_coef)
      implicit none
      integer, intent(in) :: nx, ny
      real(wp), intent(in) :: cold(0:nx+1,0:ny+1)
      real(wp), intent(out) :: cnew(0:nx+1,0:ny+1)
      real(wp), intent(inout) :: wrk(0:nx+1,0:ny+1)
      real(wp), intent(in) :: dt, A, mobility, grad_coef

      real(wp) :: lap, dfd
      integer :: i, j
      real(wp) :: alpha

      alpha = 1.0_wp / (real(dx,wp) * real(dy,wp))

      do j = 1, Ny
         do i = 1, Nx
        
            dfd = A * ( 2.0_wp*cold(i,j)*( 1.0_wp - cold(i,j) )**2 &
               - 2.0_wp*cold(i,j)**2*( 1.0_wp - cold(i,j) ) )

            lap   = alpha * ( cold(i+1,j) + cold(i-1,j) + cold(i,j-1) + cold(i,j+1) - &
                4.0_wp*cold(i,j) )

            wrk(i,j) = dfd - grad_coef*lap

         end do
      end do

      call periodic_halo(nx,ny,wrk)

      do j = 1, Ny
         do i = 1, Nx
           
           lap = alpha * ( wrk(i+1,j) + wrk(i-1,j) + wrk(i,j-1) &
                + wrk(i,j+1) - 4.0_wp*wrk(i,j) )

           cnew(i,j) =  cold(i,j) + dt*mobility*lap
           !cnew(i,j) = max(0.00001_wp,min(0.99999_wp, cnew(i,j)))
         end do
      end do

      call periodic_halo( nx, ny, cnew)

   end subroutine

end program fd_ch_test

Shahid718 · 2024-02-05T14:55:26Z

Yeah. You caught it right! I corrected the code now.

Your new code is very performance-oriented. I am not familiar with that coding style and how it is used. I will go through the details and will learn from your code. Thanks for it.

ivan-pi · 2024-02-05T17:02:48Z

The changes are essentially:

old and new arrays, so that the update can be safely executed in parallel. Depending on the parity of the time-step (i.e. if it is odd or even), we switch the roles of the in and out arrays. This is also known as the "flip-flop" or "ping-pong" approach. This could also be accomplished with two arrays (i.e. con1 and con2) or using pointers.
I added a layer of halo cells that serve as a buffer for the periodic boundaries. This makes the update more straightforward, and hence easier for the compiler to vectorize. It does have some downsides however, namely that we need to remember to update the halo elements, and that it's likely not compatible with black-box time-stepping routines.
I replaced the do concurrent loops with regular do loops. Strictly speaking this is not needed, and do concurrent will work fine. I only did this so I can experiment with OpenMP directives.

Shahid718 · 2024-02-07T09:58:36Z

How does this flip-flop technique work if we use do concurrent with OpenMP compiler flag? Or it is just for the usual do loop?

ivan-pi · 2024-02-22T20:24:35Z

You can use whatever you want inside of the update subroutine. do concurrent or !$omp parallel do collapse(2) are both valid options.

Using allocatable arrays and swapping them with move_alloc (as shown here https://fortran-lang.discourse.group/t/swapping-arrays-during-time-stepping/7354/11?u=ivanpribec) is also a good option.

The time-stepping loop would look like this:

  time_loop: do istep = 1, nsteps

      call update(nx,ny,cold=cold,cnew=cnew,wrk=lap_con, &
            dt=dt,mobility=mobility,A=A,grad_coef=grad_coef)
      
      call swap(cold,cnew)
      t = t + dt
      
  end do time_loop
  
contains

   subroutine swap(cold,cnew)
      real(wp), intent(inout), allocatable :: cold(:,:), cnew(:,:)
      real(wp), allocatable :: ctmp(:,:)
      
      call move_alloc(from=cold,to=ctmp)
      call move_alloc(from=cnew,to=cold)
      call move_alloc(from=ctmp,to=cnew)
   end subroutine

Shahid718 · 2024-02-27T16:05:11Z

The purpose here in this repository is to show the difference between the whole array and the loop. The use of parallel programming or OpenMP was not a part of this work here. I wanna do it later stage. See in the Future work. But your style of coding has shown lots of ways to do it. It is surely worth trying. I will try it and will be in touch with you. Thanks for sharing your code here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison reproducibility #2

Comparison reproducibility #2

ivan-pi commented Feb 5, 2024

Shahid718 commented Feb 5, 2024

ivan-pi commented Feb 5, 2024

ivan-pi commented Feb 5, 2024

Shahid718 commented Feb 5, 2024

ivan-pi commented Feb 5, 2024

Shahid718 commented Feb 7, 2024

ivan-pi commented Feb 22, 2024 •

edited

Loading

Shahid718 commented Feb 27, 2024

Comparison reproducibility #2

Comparison reproducibility #2

Comments

ivan-pi commented Feb 5, 2024

Shahid718 commented Feb 5, 2024

ivan-pi commented Feb 5, 2024

ivan-pi commented Feb 5, 2024

Shahid718 commented Feb 5, 2024

ivan-pi commented Feb 5, 2024

Shahid718 commented Feb 7, 2024

ivan-pi commented Feb 22, 2024 • edited Loading

Shahid718 commented Feb 27, 2024

ivan-pi commented Feb 22, 2024 •

edited

Loading