NVBLAS with the Intel Fortran compilers

I seem to be missing something when attempting to use NVBLAS with the Intel Fortran compilers.

I appear to be linking and using nvblas.conf correctly as I see feedback from the initialization of NVBLAS at runtime. However, NVBLAS does not seem to be intercepting the calls to DGEMM as only the CPU implementation is executed. This is despite using:

NVBLAS_CPU_RATIO_CGEMM 0.0

in nvblas.conf (or removing it entirely).

If I disable access to the CPU BLAS implementation by removing:

NVBLAS_CPU_BLAS_LIB  /ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs/libmkl_rt.so

the program crashes at runtime, as I would expect.

The compiler options I am currently using are shown below, I have also tried manually linking MKL, but with the same results.

# Compiler options
FFLAGS=-O3 -axAVX,SSE4.2 -msse3 -align array32byte -fpe1 -fno-alias -openmp -mkl=parallel -heap-arrays 32

 # Linker options
LDFLAGS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas

# List of libraries used
LIBS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas

An example of a call to DGEMM is as follows:

call dgemm('N','T',nCols2,nCols1,nOcc(s),2.0d0/dble(nSpins),C2,nRowsP,C(:,:,s),nRowsP,0.0d0,P(i21,i11,s),nOrbsP)

Whilst I am currently limited to using the Intel compilers, this restriction will be lifted shortly (at which point I will use CUDA Fortran to optimize data movement).

Thanks in advance,

Karl

@Karl, is it possible to provide a sample code(along with the instructions) to exhibit your issue exactly?