K80 bandwidth test

Hello

Tesla k80 bandwidth is expected to be around 280gb/s per gpu, however, for me bandwidth test from cuda samples shows only 160gb/s. What can be the problem ? How much bandwidth does it show for you (plz post if you are k80 owner).

Thanks and Regards,

Sergey.

The spec for K80 is 240GB/s per GPU:

http://www.nvidia.com/object/tesla-servers.html

That is a theoretical maximum number computed by multiplying the memory bus width by the max clock rate:

2505MHz * 2 (DDR) * 384 bits / 8bits per byte = 240GB/s

To get the full available bandwidth measurement from K80, boost clocks must be enabled, and ECC should be turned off.

boost clocks:

http://international.download.nvidia.com/tesla/pdf/gpu-boost-tesla-k40-app-note.pdf

ECC state (and clocks) can be observed and modified with the nvidia-smi tool

And after all that, the best you will observe is typically about 80-90% of theoretical maximum, due to various other overheads. (Other codes may give a slightly higher measurement. Jimmy Pettersson’s reduction code gets mentioned frequently on this forum as a peak memory bandwidth test. - you can google for that if interested.)

On our linux cluster, when I run bandwidthTest in the default mode, I get about 140GB/s. When I enable the higher applications clocks, I get around 170GB/s:

$ nvidia-smi -ac 2505,875
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:84:00.0
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:85:00.0
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:8A:00.0
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:8B:00.0
All done.
$ /shared/apps/cuda/CUDA-v7.0.28/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla K80
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10292.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10547.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     170149.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

To get closer to 200GB/s, you need to disable ECC (I can’t do that on our cluster).

You may want to set persistence mode on the GPUs to get the best numbers as well.

As a rule of thumb, with ECC enabled available effective bandwidth is about 75% of the theoretical maximum, and with ECC disabled you can probably get close to 85%. The reason for the lower throughput with ECC enabled is that the ECC information needs to be sent in the same channels as the regular data (“in-band”, so to speak), while with CPUs there is usually dedicated channel for the ECC data.

In order to maximize the memory throughput will need to use the widest loads and stores available, which are 128-bit accesses, e.g. double2, float4, uint4, although the throughput using 64-bit accesses is typically only a few percent lower. Memory throughput really starts to suffer dramatically for accesses whose width is < 32 bit.

As with most memory throughput scenarios, the size of the transfer also has some influence on the results, as larger transfers can achieve throughput closer to peak. The block size used by the bandwidthTest app seems to be 32 MB, which should be sufficient for full throughput.

There is usually a few percent difference between various grid and block configurations. Lastly, as txbob indirectly points out, on some Kepler-class GPUs the memory throughput can be limited by the speed of the internal “plumbing”, therefore requiring higher dynamic autoboost clocks or manually set application clocks to achieve the full throughput.

Independent of the memory throughput, I would encourage experiments with the application clocks. Many real-life applications have low enough power consumption that they are able to run permanently at the highest or second highest application clock. You can display the supported application clocks and set them with nvidia-smi.

Thank you for your answers guys. Sorry about not mentioning that 160gb/s (158 to be precise) is with DISABLED ECC.
That’s 65% of the maxium, which seems to be way off how it should be.

It runs on CentOs 6.6, Driver version is 346.46, card’s bios 80.21.1B.00.01
Anyone can maybe spot anything wrong with this setup ?

Here’s a bit of nvidia-smi output about clocks:

Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 24 C
        GPU Shutdown Temp           : 93 C
        GPU Slowdown Temp           : 88 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 29.03 W
        Power Limit                 : 149.00 W
        Default Power Limit         : 149.00 W
        Enforced Power Limit        : 149.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 175.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Default Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 2505 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

Thanks,
Sergey.

I have persistence mode enabled and I failed to find Jimmy Petterson’s code on google.

Try setting the highest supported application clock manually. This part appears to have auto boost enabled, but it may not kick in fast enough for a short-running benchmark. I am not sure whether manually setting the application clock on auto-boosting GPUs works as it does on GPUs without autoboost. Try (adjust device number for -i switch appropriately):

nvidia-smi -i 0 -ac 2505,875

then run the bandwidth test again. You should be able to get around 190-200 GB/sec with ECC disabled (note I have not tried on a K80 for lack of access to such a beast, but am extrapolating from other Kepler-based Tesla GPUs). For a second data point, you can also try the DCOPY program below. Note that it uses 64-bit loads/stores, performance may be a bit higher with 128-bit loads/stores.

More info on boost clocks on K80 here: http://devblogs.nvidia.com/parallelforall/increase-performance-gpu-boost-k80-autoboost/ Note this comment in the discussion: “The Tesla K80 doesn’t need to run at the max application clocks to achieve its full memory bandwidth. In my experiments, a GPU clock of 705 is sufficient for memory bound applications.”

#include <stdlib.h>
#include <stdio.h>

#define DCOPY_THREADS  128
#define DCOPY_DEFLEN   20000000
#define DCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void dcopy (const double * __restrict__ src, 
                       double * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct dcopyOpts {
    int len;
};

static int processArgs (int argc, char *argv[], struct dcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double *d_a, *d_b;
    int errors;
    struct dcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : DCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(384);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    
    printf ("dcopy: operating on vectors of %d doubles (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("dcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < DCOPY_ITER; k++) {
        start = second();
        dcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("dcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

njuffa,
doing
nvidia-smi -i 0 -ac 2505,875

boosts up bandwidth test to 199gb/ps. Your program gives about the same results as bandwidthtest from the samples.
Thus it looks like application runs underclocked all the time. Is it because autoboost doesn’t have enough time to kick in ?

Is there an ui tool similar to gpu-z where I could track the clock frequencies throughout application’s runtime ?

txbob, sorry, I overlooked your suggestion to do the nvidia-smi clock fix thingie. According to the pdf you posted, the auto-boost feature of K80 kicks in after about 4th second. Bandwidth test runs in less than that. I’ll try to extend it’s running time and see if anything gets changed.
Noob question then - why not run on boost clock all the time ? The lower base clock exists as an attempt to reduce overall power consumption only, is that correct ?

I have always preferred the following bandwidth test to the one in the SDK:

http://pastebin.com/sZCwbHVH

Have seen up to 88% of theoretical maximum for ideal powers of 2 large input sets. Here is the output for a reference Titan;

GeForce GTX TITAN X @ 336.480 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         156.66                  46.56   26.8             Pass
 2097152         198.70                  59.05   42.2             Pass
 4194304         234.13                  69.58   71.7             Pass
 8388608         256.29                  76.17   130.9            Pass
 16777216        269.80                  80.18   248.7            Pass
 33554432        277.68                  82.52   483.4            Pass
 67108864        281.75                  83.73   952.8            Pass
 134217728       283.80                  84.34   1891.7

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        269.08                  79.97   218.2            Pass
 14680119        268.75                  79.87   218.5            Pass
 18875600        266.30                  79.14   283.5            Pass
 7434886         155.41                  46.19   191.4            Pass
 13324075        240.95                  71.61   221.2            Pass
 15764213        253.10                  75.22   249.1            Pass
 1850154         59.96           17.82   123.4            Pass
 4991241         139.44                  41.44   143.2            Pass

That code won’t compile on linux there are various issues, see lines 70-71 for example.
After you fix the issues, the GB/s and perc columns are all zero because the get_clock() function is not implemented correctly on linux.

Here’s a modified version of that code that seems to compile and run correctly for me on linux:

[url]http://pastebin.com/e7SjhkXd[/url]

hope I didn’t break anything on windows…

Thnx CUdaaduc,

It doesn’t build under linux though, and my quick fixes showed max 106gb/sec (on non-base sizes). I probably did a mistake somewhere…

txbob, I get exactly same numbers as in when I ported it myself, and those are twice lower than in bandwidthtest sample. It seems like there is something wrong with the test.

These are the numbers I get on K80:

Tesla K80 @ 240.480 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576          39.60          16.47 %          105.9           Pass
 2097152          46.97          19.53 %          178.6           Pass
 4194304          63.01          26.20 %          266.2           Pass
 8388608          74.39          30.93 %          451.1           Pass
 16777216         77.75          32.33 %          863.2           Pass
 33554432         80.21          33.35 %         1673.3           Pass
 67108864         93.96          39.07 %         2856.8           Pass
 134217728        96.46          40.11 %         5565.5           Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        104.75          43.56 %          560.6           Pass
 14680119        104.76          43.56 %          560.5           Pass
 18875600        103.09          42.87 %          732.4           Pass
 7434886          67.11          27.91 %          443.2           Pass
 5392180          80.40          33.43 %          268.3           Pass
 3449291          50.44          20.97 %          273.6           Pass
 11351121         78.67          32.71 %          577.2           Pass
 2455799          34.93          14.53 %          281.2           Pass

Finished.

The version I posted seems to give reasonable numbers for me. I ran it on a Quadro5000 for which bandwidthTest reports 101GB/s, and I got:

$ ./t829
 Quadro 5000 @ 120.000 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576          85.16          70.97 %           49.3           Pass
 2097152          93.05          77.54 %           90.1           Pass
 4194304          99.69          83.07 %          168.3           Pass
 8388608         103.56          86.30 %          324.0           Pass
 16777216        105.87          88.22 %          633.9           Pass
 33554432        107.52          89.60 %         1248.3           Pass
 67108864        108.29          90.24 %         2478.9           Pass
 134217728       108.69          90.58 %         4939.4           Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102         24.88          20.74 %         2359.8           Pass
 14680119        105.12          87.60 %          558.6           Pass
 18875600        104.51          87.09 %          722.5           Pass
 7434886          73.05          60.88 %          407.1           Pass
 5392180          94.18          78.48 %          229.0           Pass
 3449291          61.55          51.29 %          224.2           Pass
 11351121         81.22          67.68 %          559.0           Pass
 2455799          40.67          33.89 %          241.5           Pass

Finished.
$ /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro 5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5858.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6337.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     101541.5

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$

I’ll try it on a K80.

Some K80 testing:

$ nvcc -arch=sm_37 -o t829 t829.cu
$ ./t829
 Tesla K80 @ 240.480 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576          81.63          33.95 %           51.4           Pass
 2097152         102.44          42.60 %           81.9           Pass
 4194304         121.24          50.42 %          138.4           Pass
 8388608         132.95          55.28 %          252.4           Pass
 16777216        136.66          56.83 %          491.1           Pass
 33554432        137.25          57.07 %          977.9           Pass
 67108864        146.60          60.96 %         1831.1           Pass
 134217728       152.94          63.60 %         3510.2           Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        162.86          67.72 %          360.6           Pass
 14680119        161.94          67.34 %          362.6           Pass
 18875600        160.01          66.54 %          471.9           Pass
 7434886         101.38          42.16 %          293.4           Pass
 5392180         137.95          57.37 %          156.4           Pass
 3449291          82.35          34.24 %          167.5           Pass
 11351121        117.04          48.67 %          387.9           Pass
 2455799          50.66          21.07 %          193.9           Pass

Finished.
$ nvidia-smi -ac 2505,875
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:84:00.0
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:85:00.0
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:8A:00.0
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:8B:00.0
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:8E:00.0
Applications clocks set to "(MEM 2505, SM 875)" for GPU 0000:8F:00.0
All done.
$ ./t829
 Tesla K80 @ 240.480 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         103.05          42.85 %           40.7           Pass
 2097152         128.03          53.24 %           65.5           Pass
 4194304         146.93          61.10 %          114.2           Pass
 8388608         161.81          67.29 %          207.4           Pass
 16777216        165.35          68.76 %          405.9           Pass
 33554432        166.89          69.40 %          804.2           Pass
 67108864        166.68          69.31 %         1610.5           Pass
 134217728       167.27          69.56 %         3209.6           Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        163.22          67.87 %          359.8           Pass
 14680119        163.35          67.93 %          359.5           Pass
 18875600        161.72          67.25 %          466.9           Pass
 7434886         103.90          43.21 %          286.2           Pass
 5392180         139.26          57.91 %          154.9           Pass
 3449291          86.40          35.93 %          159.7           Pass
 11351121        121.00          50.32 %          375.2           Pass
 2455799          55.42          23.05 %          177.2           Pass

Finished.
$ /shared/apps/cuda/CUDA-v7.0.28/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla K80
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5468.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5866.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     169247.8

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$

seems to be in the same ballpark as bandwidthTest and my previously reported numbers. It might be that Jimmy Petterson’s test could be further optimized for K80/sm_37 I haven’t taken a close look at the code.

Did not know that the code build for Windows only, sorry about that.

That code is a few years old and was optimized for the original GTX Titan AFAIK.

Thanks for this alternative bandwidth test, i have not seen this code before now! For my tiny at home GT630 Kepler CUDA setup i get between 83.97% and 88.01% of max bandwidth.