Compiling Fortran CUF Kernel for matrix-matrix multiplication

Hi all,

I am experimenting with the kernel loop directives, or the so called CUF kernels in Fortran for a naive matrix-by-matrix multiplication of arbitrary sizes. I use pgfortran from the PGI/18.4 Community Edition. I copy-paste the code, the compilation arguments and the error below.
I receive the following compilation error, which I have hard time to comprehend. Seems like the compiler manages to do part of the job, but not the whole! Do you have any thoughts here?

Thanks
Ehsan

Code:

program main

!  use cudafor
  implicit none

  integer, parameter :: sp = selected_real_kind(6)  
  integer, parameter :: dp = selected_real_kind(15)
  integer, parameter :: n = 5500, m = 3400, p = 4000
  real(dp) :: a(n, m), b(m, p), c(n, p), builtin(n, p)
  real(dp), device :: a_dev(n, m), b_dev(m, p), c_dev(n, p), val_dev
  real(dp) :: val, err
  real(sp) :: tic, toc, dt
 
  integer :: i, j, k

  call random_number(a)
  call random_number(b)

  call cpu_time(tic)
  a_dev = a; b_dev = b   ! Host-to-Device transfer 
  !$cuf kernel do (2) <<<(*,*) , (*,*)>>>
  do j = 1, p
     do i = 1, n
        val_dev = 0d0
        do k = 1, m
           val_dev = val_dev+a_dev(i,k)*b_dev(k,j)
        enddo
        c_dev(i, j) = val_dev
     enddo
  enddo
  c = c_dev          ! Device to Host transfer
  call cpu_time(toc)
  dt = toc - tic

  err = maxval(abs(matmul(a, b) - c))
  write(*, '(a, e23.16, a, f8.4)') 'max error occured = ', err, &
        '; dt [sec] = ', dt

end program main

The compilation step:

export CUFFLAGS='-Mcuda=cc6.0,cuda8.0 -Minfo=all'
pgfortran -g -O2 $CUFFLAGS -Minfo -c matmul_cuf.f90 -o matmul_cuf.o
main:
     24, CUDA kernel generated
         24, !$cuf kernel do <<< (*,*), (32,4) >>>
     36, maxval reduction inlined
         Generated vector simd code for the loop containing reductions
         Generated 2 prefetch instructions for the loop
nvvmCompileProgram error: 9.
Error: /node_scratch/20825499.moab.tier2.leuven.vsc/pgcudafor8aie0ZbnCBfK.gpu (115, 10): parse stored value and pointer type do not match
PGF90-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (matmul_cuf.f90: 1)
PGF90/x86-64 Linux 18.4-0: compilation aborted