Running Fermi-HPL (not using GPUs) Fermi-HPL benchmark not using Gpus
Hello there,
I am running HPL to test a desktop computer, now with 2 tesla [b]C2050[/b] cards [b](hpl-2.0_FERMI_v13.tgz)[/b], available on nvidia developer's zone. I run the HPL benchmark ([b]mpirun -np 2 run_linpack &[/b]) and immediately I ran the [b]"nvidia-smi -q -d MEMORY,UTILIZATION"[/b] command I got the following output:

[code]
==============NVSMI LOG==============

Timestamp : Fri Sep 30 23:55:06 2011

Driver Version : 275.09.07

Attached GPUs : 3

GPU 0:A:0 ### TESLA C2050
Memory Usage
Total : 2687 Mb
Used : 2321 Mb
Free : 365 Mb
Utilization
Gpu : 0 %
Memory : 0 %
GPU 0:8:0 ### TESLA C2050
Memory Usage
Total : 2687 Mb
Used : 2321 Mb
Free : 366 Mb
Utilization
Gpu : 0 %
Memory : 0 %
GPU 0:81:0 ### QUADRO 5000
Memory Usage
Total : 2559 Mb
Used : 16 Mb
Free : 2542 Mb
Utilization
Gpu : 0 %
Memory : 3 %
[/code]

The full [b]nvidia-smi -q[/b] command output is attached as [b][i]nvidia-smi.txt[/i][/b]

As you see, the tesla cards are using almost all of their memory, but has 0% on GPU and Memory utilization (I don't know why).
The benchmark requires a lot of time and the performance in Gflops is very low (as it were using only the CPUs), this is my HPL.out file:

[code]
...
The following parameter values will be used:

N : 51712
NB : 512
PMAP : Row-major process mapping
P : 1
Q : 2
PFACT : Left
NBMIN : 4
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 128)
L1 : no-transposed form
U : no-transposed form
EQUIL : yes
ALIGN : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0

================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR00L2L4 51712 512 1 2 1373.70 6.711e+01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0040267 ...... PASSED
================================================================================

Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
[/code]

Does anyone know why my HPL runs look like they weren't using GPUs??

Thanks for the help! :)
Hello there,

I am running HPL to test a desktop computer, now with 2 tesla C2050 cards (hpl-2.0_FERMI_v13.tgz), available on nvidia developer's zone. I run the HPL benchmark (mpirun -np 2 run_linpack &) and immediately I ran the "nvidia-smi -q -d MEMORY,UTILIZATION" command I got the following output:





==============NVSMI LOG==============



Timestamp : Fri Sep 30 23:55:06 2011



Driver Version : 275.09.07



Attached GPUs : 3



GPU 0:A:0 ### TESLA C2050

Memory Usage

Total : 2687 Mb

Used : 2321 Mb

Free : 365 Mb

Utilization

Gpu : 0 %

Memory : 0 %

GPU 0:8:0 ### TESLA C2050

Memory Usage

Total : 2687 Mb

Used : 2321 Mb

Free : 366 Mb

Utilization

Gpu : 0 %

Memory : 0 %

GPU 0:81:0 ### QUADRO 5000

Memory Usage

Total : 2559 Mb

Used : 16 Mb

Free : 2542 Mb

Utilization

Gpu : 0 %

Memory : 3 %




The full nvidia-smi -q command output is attached as nvidia-smi.txt



As you see, the tesla cards are using almost all of their memory, but has 0% on GPU and Memory utilization (I don't know why).

The benchmark requires a lot of time and the performance in Gflops is very low (as it were using only the CPUs), this is my HPL.out file:





...

The following parameter values will be used:



N : 51712

NB : 512

PMAP : Row-major process mapping

P : 1

Q : 2

PFACT : Left

NBMIN : 4

NDIV : 2

RFACT : Left

BCAST : 1ring

DEPTH : 0

SWAP : Mix (threshold = 128)

L1 : no-transposed form

U : no-transposed form

EQUIL : yes

ALIGN : 8 double precision words



--------------------------------------------------------------------------------



- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be 1.110223e-16

- Computational tests pass if scaled residuals are less than 16.0



================================================================================

T/V N NB P Q Time Gflops

--------------------------------------------------------------------------------

WR00L2L4 51712 512 1 2 1373.70 6.711e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0040267 ...... PASSED

================================================================================



Finished 1 tests with the following results:

1 tests completed and passed residual checks,

0 tests completed and failed residual checks,

0 tests skipped because of illegal input values.




Does anyone know why my HPL runs look like they weren't using GPUs??



Thanks for the help! :)
Attachments

nvidia-smi.txt

#1
Posted 10/01/2011 06:00 AM   
Enable the verbose print in the src/cuda Makefile.
You should see the DGEMM/DTRSM sent to the GPUs.
Enable the verbose print in the src/cuda Makefile.

You should see the DGEMM/DTRSM sent to the GPUs.

#2
Posted 10/05/2011 02:29 PM   
[quote name='mfatica' date='05 October 2011 - 08:29 AM' timestamp='1317824988' post='1303679']
Enable the verbose print in the src/cuda Makefile.
You should see the DGEMM/DTRSM sent to the GPUs.
[/quote]


this is the output with the verbose enabled

[code]rank 0 Assigning device 0 to process on node kamuk
rank 1 Assigning device 1 to process on node kamuk
rank 1 Allocating main buffer: 2048 MB
rank 0 Allocating main buffer: 2048 MB[/code]
[quote name='mfatica' date='05 October 2011 - 08:29 AM' timestamp='1317824988' post='1303679']

Enable the verbose print in the src/cuda Makefile.

You should see the DGEMM/DTRSM sent to the GPUs.







this is the output with the verbose enabled



rank 0 Assigning device 0  to process on node kamuk  

rank 1 Assigning device 1 to process on node kamuk

rank 1 Allocating main buffer: 2048 MB

rank 0 Allocating main buffer: 2048 MB

#3
Posted 10/13/2011 06:19 PM   
Did you ever figure out what your problem was? When I run that same version of HPL, if I use nvidia-smi -q every 5 seconds or so, sometimes all 4 will show 99% usage and then only one will show that and the rest show 0%. When using 4 GPUs, I get an HPL score of 240 GFLOPS, but if I use the regular HPL with only 24 CPUs, I get a score of 160 GFLOPs.
Did you ever figure out what your problem was? When I run that same version of HPL, if I use nvidia-smi -q every 5 seconds or so, sometimes all 4 will show 99% usage and then only one will show that and the rest show 0%. When using 4 GPUs, I get an HPL score of 240 GFLOPS, but if I use the regular HPL with only 24 CPUs, I get a score of 160 GFLOPs.

#4
Posted 02/17/2012 05:33 PM   
[quote name='mfatica' date='05 October 2011 - 02:29 PM' timestamp='1317824988' post='1303679']
Enable the verbose print in the src/cuda Makefile.
You should see the DGEMM/DTRSM sent to the GPUs.
[/quote]


I want to use the fist and third GPU ,can you help me ?
[quote name='mfatica' date='05 October 2011 - 02:29 PM' timestamp='1317824988' post='1303679']

Enable the verbose print in the src/cuda Makefile.

You should see the DGEMM/DTRSM sent to the GPUs.







I want to use the fist and third GPU ,can you help me ?

#5
Posted 04/23/2012 10:16 AM   
Add this variable to the run_linpack script:

export CUDA_VISIBLE_DEVICES="0,2"
Add this variable to the run_linpack script:



export CUDA_VISIBLE_DEVICES="0,2"

#6
Posted 04/23/2012 03:15 PM   
Scroll To Top