CUDA accelerated Linpack seemingly not using any GPU

Hello everyone,

I’m trying to benchmark a cluster with 7 GPU-nodes using NVIDIA’s CUDA Linpack, every node contains

  • 2x Intel Xeon E5-2640 v4
  • 64 GB Memory
  • 4x Tesla P100 16 GB HBM2.

I tried a lot of different variables to tune the benchmark but my results are always very bad. Best result was around 3.5 TFlops with all 28 GPUs. That is a efficiency of ~3%. Even considering https://devtalk.nvidia.com/default/topic/991058/poor-results-from-cuda-linpack-on-k80/ that result is too bad, right?

While the benchmark is running the nvidia-smi command shows barely any usage: ~45W/300W, 0% GPU-Util, ~2400/~16000 MiB Memory. Could it be that the benchmark isn’t even using the GPUs? Or is nvidia-smi not the right way to check that?

Since I’m not getting any kind of warning or error message (every result says PASSED too) I don’t know what to do next or where to change settings (HPL.dat? run_linpack?).

If you need more information to assist me I will gladly provide them.

What is the maximum score you can get considering just a single node (with its 4 GPUs) ?
Do you have an infiniband connection between the nodes?

With one node (PxQ = 2x2) I achieved 0.437 TFlops.

The description of the cluster has the following information about the infiniband connection, so I suppose the answer to your question is yes:

Interconnect
eth0 onboard 1GB Ethernet (service) 
eth1 onboard 1GB Ethernet (not being used) 
ipmi dedicated
ib0 Intel Omni-Path HFI Adapter 100 Series (58 GBit/s Infiniband)

You would need to make sure your MPI is configured to use infiniband.

Also, 0.4TF is low for the single node score. Honestly, I can’t figure out how you got 3.5TF across 7 nodes if you only got 0.4TF on a single node.

You’ll need to use a much larger problem size to get full perf out of 4 P100 GPUs in a single node, and it’s very likely that the 64GB main memory per node will be a limiting factor here.

I would start by focusing on getting the most performance from a single node. What is the largest problem size you can run with a 2x2 grid?

Can you get higher performance (than 0.4TF) if you just run on a single GPU? (so P=Q=1 in that case)

HPL (HPC Linpack) is outside my area of expertise.

However, the measured 0.4 TF per node in Linpack combined with the fact that nvidia-smi reports 0% GPU utilization and 45 W power consumption for each GPU strongly suggests that the GPUs are not being used.

While it is not clear what system memory requirements your day-to-day workloads have, the system memory seems undersized for a general-purpose HPC node, as txbob says. You would want 4-8 GB per CPU core, and your system has 20 CPU cores per node. Also, system memory to GPU memory ratio should be 2:1 to 4:1, and you have 64 GB of GPU memory per node.

I used

N*N*8bytes = required memory size

as orientation for the problem size N. So 85000-89000 should be no problem with a 2x2 grid. The 0.4TF result had a smaller problem size, N=40000.

I’m currently running a 2x2 gird with a problem size of 85000. I will try P=Q=1 when it finished in a few minutes.

Okay, I finished those two runs:

2x2    N=85000    0.4598 TF
1x1    N=85000    0.2584 TF

For the 1x1 I did change the values of CPU_CORES_PER_GPU, CUDA_DGEMM_SPLIT and CUDA_DTRSM_SPLIT in run_linpack slightly, though.

Using the method linked below to measure utilization etc. I always got the same symptoms when I started the benchmark: power draw increases just a little (idle 33W to 45W), 0% GPU-Utilization with just a small peak < 10% in the first 1 or 2 seconds, small increase in memory usage (0 MiB to 2500 MiB).

https://stackoverflow.com/questions/8223811/top-command-for-gpus-using-cuda

@njuffa:
my universty purchased that cluster last year, I’m now just using it. Except for 8 so called fat nodes with 256 GB memory each (unfortunately without GPU), all ~300+ nodes have only 64 GB memory.

To get higher scores, you need to push N higher (as high as it will go). If you search around, you can find rules of thumb for how to compute the max N for a given machine (a given system memory size) but for the 1x1 and 2x2 cases you can just use trial and error.

If N=85000 is the highest you can go, then that would indicate that system memory is the limiting factor here.

I am sorry to read that. Maybe the sudden increase in DRAM prices caught them by surprise and they had to cut system memory size because of it. IMHO, 256 GB per node would be optimal given other system specs.

I just tried a bunch of different problem sizes for a 1x1 run. The highest I could go was N=88000. Otherwise I got

Failed to cudaHostRegister pinned memory

So not seeing any GPU utilization with nvidia-smi is just a consequence of the limiting memory?

Yes, with a small problem size (N), you are doing very little work overall, and the GPU isn’t doing much. This should be evident from the low score. For a decent run, you want N that is well over 100000.

64GB system memory is just too small to be interesting for GPU accelerated HPL. The sizable memory allocation (gigabytes) means the GPUs are being used during this test. Just not to their capacity/capability.

And it makes sense that 88000 would top out a 64GB config.

Ok, I will talk to my adviser about this but I don’t get any high hopes about getting more memory.

I actually never did any CUDA programming before so I would like to ask: are those 64 GB memory at least viable for “normal” scientific programs?

You can get plenty of CUDA work done with 64 GB of system memory. However, if your plan was to run at maximum performance using all four Teslas P100 at the same time, you may find the small system memory to be a limiting factor more often than you care for. I learned the hard way that skimping on system memory is not the way to.

What kind of GPU-accelerated workloads do you anticipate running? For most well-known HPC applications there are hardware recommendations, including system memory size, so check the documentation of whatever apps you plan to run.

For guidance for well-balanced GPU-accelerated HPC nodes, one could look to NVIDIA’s DGX-1 (2x E5-2698 with 40 CPU cores, 8x P100 with 128 GB of memory, 512 GB system memory) or the nodes for the upcoming Summit supercomputer (2x Power 9 with 44 CPU cores, 4x V100 with 64 GB of memory, 512 GB system memory).

Just to be clear, the comment I made in the thread you linked is still applicable here:

[url]https://devtalk.nvidia.com/default/topic/991058/cuda-programming-and-performance/poor-results-from-cuda-linpack-on-k80/post/5074677/#5074677[/url]

I did not bother to repeat that statement since you had already linked that thread. However you cannot and will not get “full” performance out of any newer GPU using this particular HPL distribution. However I assume that you are simply questioning the results you are getting, and what the limiting factors may be. I believe one possible limiting factor is (system) memory size.

I did some more testing including using the nvprof command to see what the GPUs are doing. I have to admit I dont’t really understand it’s output (or if it’s feasible to use in my case at all) but someone here maybe can help me understand it.

I’m especially curious about line 9, 19, 29, 39 in the second code block. Shouldn’t there be a value for “Grid Size” and “Block Size”?

[(...)@gpu08 CUDA]$ nvprof --profile-child-processes  mpirun -np 4 ./run_linpack
==3888== NVPROF is profiling process 3888, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==3890== NVPROF is profiling process 3890, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==3889== NVPROF is profiling process 3889, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==3886== NVPROF is profiling process 3886, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==3886== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==3886== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.5680us         1  1.5680us  1.5680us  1.5680us  [CUDA memcpy HtoD]
      API calls:   58.00%  2.50489s         3  834.96ms  2.8980us  2.50489s  cudaFree
                   25.88%  1.11748s         5  223.50ms  1.6387ms  1.02260s  cudaHostRegister
                   14.27%  616.25ms         5  123.25ms  635.84us  566.33ms  cudaHostUnregister
                    1.08%  46.686ms       740  63.089us     223ns  5.5438ms  cuDeviceGetAttribute
                    0.37%  15.856ms         5  3.1712ms  859.39us  5.3338ms  cudaGetDeviceProperties
                    0.21%  8.9846ms         8  1.1231ms  486.94us  2.0516ms  cuDeviceTotalMem
                    0.11%  4.7308ms         4  1.1827ms  16.284us  3.3020ms  cudaMalloc
                    0.07%  2.9980ms         8  374.75us  75.375us  722.88us  cuDeviceGetName
                    0.01%  472.88us         1  472.88us  472.88us  472.88us  cudaMemGetInfo
                    0.00%  111.91us         4  27.978us  17.915us  52.822us  cuStreamCreate
                    0.00%  42.309us         1  42.309us  42.309us  42.309us  cudaMemcpy
                    0.00%  22.970us        16  1.4350us     928ns  5.7780us  cudaEventCreateWithFlags
                    0.00%  13.865us         2  6.9320us  2.0180us  11.847us  cudaSetDevice
                    0.00%  12.614us        11  1.1460us     492ns  6.0460us  cudaDeviceGetAttribute
                    0.00%  5.8950us        12     491ns     303ns     969ns  cuDeviceGet
                    0.00%  5.6740us         2  2.8370us  2.0680us  3.6060us  cudaGetDevice
                    0.00%  4.6540us         4  1.1630us     413ns  2.9560us  cuDeviceGetCount
                    0.00%  2.4320us         1  2.4320us  2.4320us  2.4320us  cudaGetDeviceCount
                    0.00%     661ns         1     661ns     661ns     661ns  cuInit
                    0.00%     574ns         1     574ns     574ns     574ns  cuDriverGetVersion
                    0.00%     484ns         1     484ns     484ns     484ns  cuCtxGetCurrent
==3889== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==3889== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.5680us         1  1.5680us  1.5680us  1.5680us  [CUDA memcpy HtoD]
      API calls:   60.90%  2.53778s         3  845.93ms  2.6570us  2.53777s  cudaFree
                   26.74%  1.11434s         5  222.87ms  1.2181ms  1.02326s  cudaHostRegister
                   10.78%  449.01ms         5  89.801ms  531.97us  391.79ms  cudaHostUnregister
                    1.03%  42.933ms       740  58.017us     126ns  5.5891ms  cuDeviceGetAttribute
                    0.17%  6.9107ms         5  1.3821ms  1.0119ms  1.5356ms  cudaGetDeviceProperties
                    0.15%  6.2556ms         4  1.5639ms  21.983us  4.3851ms  cudaMalloc
                    0.13%  5.3658ms         8  670.73us  416.18us  1.1429ms  cuDeviceTotalMem
                    0.08%  3.5057ms         8  438.22us  206.01us  733.79us  cuDeviceGetName
                    0.02%  675.32us         1  675.32us  675.32us  675.32us  cudaMemGetInfo
                    0.00%  168.79us         4  42.197us  31.637us  71.838us  cuStreamCreate
                    0.00%  51.997us         1  51.997us  51.997us  51.997us  cudaMemcpy
                    0.00%  30.805us        16  1.9250us  1.3140us  6.9970us  cudaEventCreateWithFlags
                    0.00%  16.144us        11  1.4670us     733ns  7.9140us  cudaDeviceGetAttribute
                    0.00%  11.660us         2  5.8300us  2.8530us  8.8070us  cudaSetDevice
                    0.00%  11.071us         2  5.5350us  4.5070us  6.5640us  cudaGetDevice
                    0.00%  4.0600us         4  1.0150us     225ns  2.9820us  cuDeviceGetCount
                    0.00%  3.9600us        12     330ns     210ns     903ns  cuDeviceGet
                    0.00%  2.0560us         1  2.0560us  2.0560us  2.0560us  cudaGetDeviceCount
                    0.00%     616ns         1     616ns     616ns     616ns  cuCtxGetCurrent
                    0.00%     478ns         1     478ns     478ns     478ns  cuInit
                    0.00%     333ns         1     333ns     333ns     333ns  cuDriverGetVersion
==3890== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==3890== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.5680us         1  1.5680us  1.5680us  1.5680us  [CUDA memcpy HtoD]
      API calls:   64.45%  2.52776s         3  842.59ms  2.5670us  2.52775s  cudaFree
                   27.68%  1.08557s         4  271.39ms  605.60us  1.02350s  cudaHostRegister
                    6.22%  243.81ms         4  60.951ms  639.14us  229.41ms  cudaHostUnregister
                    1.09%  42.611ms       740  57.582us     127ns  5.5746ms  cuDeviceGetAttribute
                    0.17%  6.8539ms         5  1.3708ms  884.16us  1.5246ms  cudaGetDeviceProperties
                    0.17%  6.5563ms         8  819.53us  325.12us  1.5320ms  cuDeviceTotalMem
                    0.13%  5.1450ms         4  1.2863ms  18.305us  3.5738ms  cudaMalloc
                    0.07%  2.8505ms         8  356.31us  94.361us  735.36us  cuDeviceGetName
                    0.01%  553.11us         1  553.11us  553.11us  553.11us  cudaMemGetInfo
                    0.00%  131.73us         4  32.933us  25.012us  51.296us  cuStreamCreate
                    0.00%  42.473us         1  42.473us  42.473us  42.473us  cudaMemcpy
                    0.00%  23.605us        16  1.4750us     958ns  5.7620us  cudaEventCreateWithFlags
                    0.00%  13.153us        11  1.1950us     627ns  6.3480us  cudaDeviceGetAttribute
                    0.00%  11.792us         2  5.8960us  2.7250us  9.0670us  cudaSetDevice
                    0.00%  8.6550us         2  4.3270us  3.7280us  4.9270us  cudaGetDevice
                    0.00%  3.0740us        12     256ns     155ns     539ns  cuDeviceGet
                    0.00%  2.5350us         4     633ns     181ns  1.7140us  cuDeviceGetCount
                    0.00%  1.3750us         1  1.3750us  1.3750us  1.3750us  cudaGetDeviceCount
                    0.00%     590ns         1     590ns     590ns     590ns  cuCtxGetCurrent
                    0.00%     434ns         1     434ns     434ns     434ns  cuInit
                    0.00%     250ns         1     250ns     250ns     250ns  cuDriverGetVersion
==3888== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==3888== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.6320us         1  1.6320us  1.6320us  1.6320us  [CUDA memcpy HtoD]
      API calls:   65.72%  2.48343s         3  827.81ms  2.1750us  2.48343s  cudaFree
                   28.69%  1.08394s         4  270.99ms  792.79us  1.02351s  cudaHostRegister
                    4.09%  154.59ms         4  38.648ms  766.66us  127.83ms  cudaHostUnregister
                    1.04%  39.287ms       740  53.090us     144ns  5.5743ms  cuDeviceGetAttribute
                    0.16%  5.9758ms         8  746.97us  381.88us  1.3386ms  cuDeviceTotalMem
                    0.11%  4.2517ms         4  1.0629ms  15.242us  2.8708ms  cudaMalloc
                    0.11%  3.9923ms         5  798.47us  752.17us  884.59us  cudaGetDeviceProperties
                    0.07%  2.5289ms         8  316.11us  66.691us  709.12us  cuDeviceGetName
                    0.01%  456.80us         1  456.80us  456.80us  456.80us  cudaMemGetInfo
                    0.00%  77.571us         4  19.392us  12.079us  38.828us  cuStreamCreate
                    0.00%  36.156us         1  36.156us  36.156us  36.156us  cudaMemcpy
                    0.00%  20.446us        16  1.2770us     777ns  5.4670us  cudaEventCreateWithFlags
                    0.00%  11.225us        11  1.0200us     466ns  5.6380us  cudaDeviceGetAttribute
                    0.00%  10.441us         2  5.2200us  1.9960us  8.4450us  cudaSetDevice
                    0.00%  7.9130us         2  3.9560us  3.1370us  4.7760us  cudaGetDevice
                    0.00%  4.7090us        12     392ns     231ns  1.0130us  cuDeviceGet
                    0.00%  4.6270us         4  1.1560us     234ns  3.4350us  cuDeviceGetCount
                    0.00%  3.3870us         1  3.3870us  3.3870us  3.3870us  cudaGetDeviceCount
                    0.00%     560ns         1     560ns     560ns     560ns  cuInit
                    0.00%     334ns         1     334ns     334ns     334ns  cuDriverGetVersion
                    0.00%     278ns         1     278ns     278ns     278ns  cuCtxGetCurrent
[(...)@gpu08 CUDA]$ nvprof --profile-child-processes --print-gpu-trace  mpirun -np 4 ./run_linpack
==7381== NVPROF is profiling process 7381, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7383== NVPROF is profiling process 7383, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7379== NVPROF is profiling process 7379, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7382== NVPROF is profiling process 7382, command: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7383== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7383== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.91844s  1.6000us                    -               -         -         -         -      112B  66.757MB/s    Pageable      Device  Tesla P100-SXM2         1         7  [CUDA memcpy HtoD]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
==7379== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7379== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.95164s  1.5040us                    -               -         -         -         -      112B  71.018MB/s    Pageable      Device  Tesla P100-SXM2         1         7  [CUDA memcpy HtoD]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
==7382== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7382== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.94910s  1.5680us                    -               -         -         -         -      112B  68.120MB/s    Pageable      Device  Tesla P100-SXM2         1         7  [CUDA memcpy HtoD]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
==7381== Profiling application: /home/(...)/hpl-2.0_FERMI_v15/bin/CUDA/xhpl
==7381== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.98803s  1.6960us                    -               -         -         -         -      112B  62.978MB/s    Pageable      Device  Tesla P100-SXM2         1         7  [CUDA memcpy HtoD]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

I just don’t want to skip any possibility to figure out what the limiting factor is.

there should not be any grid or block size in these lines, as this GPU profile only contains a single host to device memory copy.

It appears to not call any kernel during the duration of your profiling, which is worrysome.

Christian

@txbob, does NVIDIA offer any newer version of HPL CUDA implementation than hpl-2.0_FERMI_v15 ?

@txbob, does NVIDIA offer any newer version of HPL CUDA implementation than hpl-2.0_FERMI_v15 ?

They do, but not to the publicity. We got a newer version from the vendor of our cluster.