Running Fermi-HPL (not using GPUs) Fermi-HPL benchmark not using Gpus

Hello there,

I am running HPL to test a desktop computer, now with 2 tesla C2050 cards (hpl-2.0_FERMI_v13.tgz), available on nvidia developer’s zone. I run the HPL benchmark (mpirun -np 2 run_linpack &) and immediately I ran the “nvidia-smi -q -d MEMORY,UTILIZATION” command I got the following output:

==============NVSMI LOG==============

Timestamp                       : Fri Sep 30 23:55:06 2011

Driver Version                  : 275.09.07

Attached GPUs                   : 3

GPU 0:A:0   ### TESLA C2050

    Memory Usage

        Total                   : 2687 Mb

        Used                    : 2321 Mb

        Free                    : 365 Mb

    Utilization

        Gpu                     : 0 %

        Memory                  : 0 %

GPU 0:8:0  ### TESLA C2050

    Memory Usage

        Total                   : 2687 Mb

        Used                    : 2321 Mb

        Free                    : 366 Mb

    Utilization

        Gpu                     : 0 %

        Memory                  : 0 %

GPU 0:81:0  ### QUADRO 5000

    Memory Usage

        Total                   : 2559 Mb

        Used                    : 16 Mb

        Free                    : 2542 Mb

    Utilization

        Gpu                     : 0 %

        Memory                  : 3 %

The full nvidia-smi -q command output is attached as nvidia-smi.txt

As you see, the tesla cards are using almost all of their memory, but has 0% on GPU and Memory utilization (I don’t know why).

The benchmark requires a lot of time and the performance in Gflops is very low (as it were using only the CPUs), this is my HPL.out file:

...

The following parameter values will be used:

N      :   51712 

NB     :     512 

PMAP   : Row-major process mapping

P      :       1 

Q      :       2 

PFACT  :    Left 

NBMIN  :       4 

NDIV   :       2 

RFACT  :    Left 

BCAST  :   1ring 

DEPTH  :       0 

SWAP   : Mix (threshold = 128)

L1     : no-transposed form

U      : no-transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be               1.110223e-16

- Computational tests pass if scaled residuals are less than                16.0

================================================================================

T/V                N    NB     P     Q               Time                 Gflops

--------------------------------------------------------------------------------

WR00L2L4       51712   512     1     2            1373.70              6.711e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0040267 ...... PASSED

================================================================================

Finished      1 tests with the following results:

              1 tests completed and passed residual checks,

              0 tests completed and failed residual checks,

              0 tests skipped because of illegal input values.

Does anyone know why my HPL runs look like they weren’t using GPUs??

Thanks for the help! :)
nvidia-smi.txt (7.3 KB)

Enable the verbose print in the src/cuda Makefile.
You should see the DGEMM/DTRSM sent to the GPUs.

this is the output with the verbose enabled

rank 0 Assigning device 0  to process on node kamuk  

rank 1 Assigning device 1  to process on node kamuk  

rank 1 Allocating main buffer: 2048 MB 

rank 0 Allocating main buffer: 2048 MB

Did you ever figure out what your problem was? When I run that same version of HPL, if I use nvidia-smi -q every 5 seconds or so, sometimes all 4 will show 99% usage and then only one will show that and the rest show 0%. When using 4 GPUs, I get an HPL score of 240 GFLOPS, but if I use the regular HPL with only 24 CPUs, I get a score of 160 GFLOPs.

I want to use the fist and third GPU ,can you help me ?

Add this variable to the run_linpack script:

export CUDA_VISIBLE_DEVICES=“0,2”