Stream Benchmark

I recently discovered the Stream Benchmark (MEMORY BANDWIDTH: STREAM BENCHMARK PERFORMANCE RESULTS). Is there a CUDA port for this benchmark?

Would it make sense to write a CUDA port and compare the results? If yes questions concerning coalesced memory access and shared memory arise. Any thoughts?

I did a quick port.
These are the initial results on a C870.

STREAM Benchmark implementation in CUDA
Array size (single precision)=2000000
using 128 threads per block, 15625 blocks
Function Rate (MB/s) Avg time Min time Max time
Copy: 64021.5140 0.0003 0.0002 0.0003
Scale: 64273.1863 0.0003 0.0002 0.0003
Add: 65413.3757 0.0004 0.0004 0.0004
Triad: 65411.1342 0.0004 0.0004 0.0004

If you want to increase the array size, you can increase the number of threads per block, or change the mapping from local thread idx to global.

The source code is attached (rename it to stream.cu and compile it with nvcc -O3 -o stream stream.cu). It is using a Unix timer.
stream.cu.txt (4.2 KB)

Regarding the CUDA port for the stream benchmark, could someone please explain the purpose for skipping the first iteration when calculating the average, min, and max times?

Here is the relevant section of code from mfatica’s port:

[codebox]/* — MAIN LOOP — repeat test cases NTIMES times — */

scalar=3.0f;

for (k=0; k<NTIMES; k++)

{

times[0][k]= mysecond();

STREAM_Copy<<<dimGrid,dimBlock>>>(d_a, d_c, N);

cudaThreadSynchronize();

times[0][k]= mysecond() - times[0][k];

times[1][k]= mysecond();

STREAM_Scale<<<dimGrid,dimBlock>>>(d_b, d_c, scalar, N);

cudaThreadSynchronize();

times[1][k]= mysecond() - times[1][k];

times[2][k]= mysecond();

STREAM_Add<<<dimGrid,dimBlock>>>(d_a, d_b, d_c, N);

cudaThreadSynchronize();

times[2][k]= mysecond() - times[2][k];

times[3][k]= mysecond();

STREAM_Triad<<<dimGrid,dimBlock>>>(d_b, d_c, d_a, scalar, N);

cudaThreadSynchronize();

times[3][k]= mysecond() - times[3][k];

}

/* — SUMMARY — */

for (k=1; k<NTIMES; k++) /* note – skip first iteration */

{

for (j=0; j<4; j++)

    {

    avgtime[j] = avgtime[j] + times[j][k];

    mintime[j] = MIN(mintime[j], times[j][k]);

    maxtime[j] = MAX(maxtime[j], times[j][k]);

    }

}

printf(“Function Rate (MB/s) Avg time Min time Max time\n”);

for (j=0; j<4; j++) {

avgtime[j] = avgtime[j]/(double)(NTIMES-1);

printf("%s%11.4f  %11.4f  %11.4f  %11.4f\n", label[j],

       1.0E-06 * bytes[j]/mintime[j],

       avgtime[j],

       mintime[j],

       maxtime[j]);

}

[/codebox]

It is the same in the original CPU code, you may want to ask the author of the CPU code.

The first call to CUDA in a program is usually slow, as it needs to load the kernels to the GPU, initialize the drivers, etc. That’s why it makes sense to skip it.

I have gone ahead a fixed some errors associated with the original poster’s implementaion, namely timing device-side operations using host-side timers is a tricky issue given the two concurrent execution contexts. See section 6.3 “Measuring Performance with Events” in “CUDA by Example” by Sanders and Kandrot.

The implementation requires that one uses it within the SDK development environment, i.e. it is dependent upon the shrLog’ing facility and associated cputimers. I have also included a Makefile which works with SDK environment, so it should be easy to run this example within the SDK examples directory.

Some timings for two of the NVIDIA GPUs I have are also included

../../bin/linux/release/streamsp Starting...

Running on...

Device 0: GeForce GTX 260

 Array size (single precision) = 2000000

 using 128 threads per block, 15625 blocks

 device STREAM_Copy:            Pass

 device STREAM_Copy_Optimized:  Pass

 device STREAM_Scale:           Pass

 device STREAM_Add:             Pass

 device STREAM_Triad:           Pass

Function    Rate (GB/s)    Avg time      Min time      Max time

Copy:      3.482137e+01  4.603663e-04  4.594880e-04  4.617280e-04

Copy Opt.: 8.869966e+01  1.848782e-04  1.803840e-04  2.036160e-04

Scale:     3.455903e+01  4.640000e-04  4.629760e-04  4.662080e-04

Add:       4.920614e+01  4.890560e-04  4.877440e-04  4.947200e-04

Triad:     4.853426e+01  4.954027e-04  4.944960e-04  4.979840e-04

[streamBenchmark] - results:    PASSES

Press <Enter> to Quit...

-----------------------------------------------------------

and

../../bin/linux/release/streamsp Starting...

Running on...

Device 1: GeForce 8400 GS

 Array size (single precision) = 2000000

 using 128 threads per block, 15625 blocks

 device STREAM_Copy:            Pass

 device STREAM_Copy_Optimized:  Pass

 device STREAM_Scale:           Pass

 device STREAM_Add:             Pass

 device STREAM_Triad:           Pass

Function    Rate (GB/s)    Avg time      Min time      Max time

Copy:      1.594210e+00  1.003930e-02  1.003632e-02  1.004387e-02

Copy Opt.: 4.315181e+00  3.724751e-03  3.707840e-03  3.738144e-03

Scale:     1.554243e+00  1.029760e-02  1.029440e-02  1.030131e-02

Add:       2.245959e+00  1.069060e-02  1.068586e-02  1.069514e-02

Triad:     2.224984e+00  1.079055e-02  1.078659e-02  1.079536e-02

[streamBenchmark] - results:    PASSES

Press <Enter> to Quit...

-----------------------------------------------------------

The Copy Optimized implementation is device grid thread-block dependent.

dpe
Makefile.txt (2.08 KB)
streamsp.cu (15.4 KB)

I have fixed some of my own errors (gasp) and now the implementation is working in a manner consistent with the original implementation idea. The previously posted results have not really changed based upon this updated implementation.

Running on my GTX460 (compute capability 2.1) card results in the following:

[Single-Precision Device-Only STREAM Benchmark implementation in CUDA]

../../bin/linux/release/streamsp Starting...

Running on...

Device 0: GeForce GTX 460

 Array size (single precision) = 2000000

 using 192 threads per block, 10417 blocks

 device STREAM_Copy:            Pass

 device STREAM_Copy_Optimized:  Pass

 device STREAM_Scale:           Pass

 device STREAM_Add:             Pass

 device STREAM_Triad:           Pass

Function    Rate (MB/s)    Avg time      Min time      Max time

Copy:       68549.4869     0.000234      0.000233      0.000235

Copy Opt.:  75711.6877     0.000212      0.000211      0.000214

Scale:      68418.1717     0.000235      0.000234      0.000238

Add:        89637.8610     0.000268      0.000268      0.000269

Triad:      89595.0289     0.000269      0.000268      0.000269

[streamBenchmark] - results:    PASSES

Note that the implementation uses 128 threads per block for GF100/GT200 (Compute Capability 2.0 or below) architectures and 192 threads per block for GTX460 (Compute Capability 2.1) given that these threads per block counts yielded best (highest) STREAM processing rates given the very limited amount of playing around with this parameter that I’ve done.

It is interesting to note that at least for a compute capability 2.x system, the performance difference between regular Copy() and optimized Copy() is not as dramatic as with compute capability 1.x systems. It is not really clear to me why this is. Any thoughts concerning this would be appreciated. I am using CUDA 3.2 so it is not a question of different software versions.

dpe
streamsp.cu (16.8 KB)

Some additional double-precision results for my GeForce GTX 460 including optimized versions of the regular STREAM operations.

[Double-Precision Device-Only STREAM Benchmark implementation in CUDA]

../../bin/linux/release/streamdp Starting...

Device 0: GeForce GTX 460

 Array size (double precision) = 2000000

 using 192 threads per block, 10417 blocks

 device STREAM_Copy:                    Pass

 device STREAM_Copy_Optimized:          Pass

 device STREAM_Scale:                   Pass

 device STREAM_Scale_Optimized:         Pass

 device STREAM_Add:                     Pass

 device STREAM_Add_Optimzied:           Pass

 device STREAM_Triad:                   Pass

 device STREAM_Triad_Optimized:         Pass

Function    Rate (MB/s)    Avg time      Min time      Max time

Copy:      100755.6574     0.000320     0.000318     0.000322

Copy Opt:  101502.2289     0.000318     0.000315     0.000319

Scale:     100090.0690     0.000323     0.000320     0.000325

Scale Opt: 100928.5353     0.000320     0.000317     0.000321

Add:       101180.4315     0.000477     0.000474     0.000479

Add Opt:   101385.6003     0.000476     0.000473     0.000477

Triad:     100948.9154     0.000479     0.000475     0.000481

Triad Opt: 101119.0476     0.000477     0.000475     0.000479

[streamBenchmark] - results:    PASSES

Using CUDA 3.2. Pretty amazing all these results are within ~85% of the peak bandwidth performance of the card and over 100 GB/sec.

[quote name=‘ndv14’ date=‘16 June 2009 - 11:11 AM’ timestamp=‘1245172306’ post=‘553327’]

Regarding the CUDA port for the stream benchmark, could someone please explain the purpose for skipping the first iteration when calculating the average, min, and max times?

STREAM skips the first iteration when calculating the average, min, and max times because the amount of clean and dirty data in the caches is different when the first iteration executes and this might systematically bias the results. The contents of the caches should be (statistically) the same after the first iteration, so the execution of the second and subsequent iterations should result in the same number of cast-outs/victims.

This could also be used to intentionally “cheat”. For example if all the caches were invalidated before a kernel test, the caches would not need to perform any cast-outs (clean or dirty victims) until the cache overflowed due to capacity. If the array size is at the minimum allowed (4x the aggregate cache size), this could produce a small reduction in execution time. The reason that this is not allowed is that the benchmark cannot force all of the data written to memory – the kernel ends (and the timing is recorded) when the final data is stored into the cache. The benchmark depends on dirty cast-outs of data that was in the cache before the kernel begins execution to balance the lack of cast-outs of the final cache state containing some fraction of modified data.

This is probably less important on GPUs and when compiling CPU code with streaming stores, but it seems like a reasonable precaution to leave in the code.

I am just beginning to learn about cuda programming on NVIDIA GPUs, but my variant of Massimiliano Fatica’s cuda STREAM code was giving up to 105 GB/s using “float” and up to 108 GB/s using “double” on an M2050 system this weekend.

Hi John,

This is dpephd (Doug Enright) who produced the updated streamsp.cu implementation and results posted to this thread. It is really great to see the originator of the STREAM benchmark starting to examine GPUs for their STREAM processing capabilities.

I would be happy to share my single and double precision STREAM implementations with you. I am somewhat concerned that Massimiliano’s implementation is too “optimized” and is not capable of processing the various stream kernels for arbitrary sized inputs which I would think you would want when reporting “standard STREAM processing” results for. Do you have any thoughts about this?

Also, has anyone reported GPU STREAM results to you for inclusion on the various results lists you maintain? The results I was able to get with my consumer GeForce GTX 460 card would place the card in the top 20 of all machines reported on your “top 20” standard STREAM benchmark results.

Best,

Doug (dpephd-nvidia@yahoo.com)

We (myself and Wen Zheng) just ran the STREAM benchmark on a GTX 480 with 177 GB/s peak bandwidth … below is the single and double precision results

Device 0: GeForce GTX 480

Array size (single precision) = 8000000

using 192 threads per block, 41667 blocks

device STREAM_Copy:                    Pass

device STREAM_Copy_Optimized:          Pass

device STREAM_Scale:                   Pass

device STREAM_Scale_Optimized:         Pass

device STREAM_Add:                     Pass

device STREAM_Add_Optimzied:           Pass

device STREAM_Triad:                   Pass

device STREAM_Triad_Optimized:         Pass

Function    Rate (MB/s)    Avg time      Min time      Max time

Copy:      148544.2617     0.000432      0.000431      0.000433

Copy Opt:  147645.0564     0.000435      0.000433      0.000440

Scale:     147863.3671     0.000434      0.000433      0.000438

Scale Opt: 146810.5303     0.000436      0.000436      0.000438

Add:       160136.6409     0.000603      0.000599      0.000606

Add Opt:   159379.4777     0.000603      0.000602      0.000604

Triad:     159957.3337     0.000602      0.000600      0.000603

Triad Opt: 159218.7519     0.000603      0.000603      0.000603
Device 0: GeForce GTX 480

Array size (double precision) = 2000000

using 128 threads per block, 15625 blocks

device STREAM_Copy:                    Pass

device STREAM_Copy_Optimized:          Pass

device STREAM_Scale:                   Pass

device STREAM_Scale_Optimized:         Pass

device STREAM_Add:                     Pass

device STREAM_Add_Optimzied:           Pass

device STREAM_Triad:                   Pass

device STREAM_Triad_Optimized:         Pass

Function    Rate (MB/s)    Avg time      Min time      Max time

Copy:      156079.2772     0.000206     0.000205     0.000209

Copy Opt:  155811.7711     0.000206     0.000205     0.000207

Scale:     155666.2449     0.000206     0.000206     0.000206

Scale Opt: 155255.3928     0.000207     0.000206     0.000207

Add:       160496.4640     0.000301     0.000299     0.000302

Add Opt:   160496.4640     0.000301     0.000299     0.000303

Triad:     159320.2333     0.000302     0.000301     0.000302

Triad Opt: 159320.2333     0.000302     0.000301     0.000303

~160 GB/s Triad bandwidth … truly remarkable! This result would place the GTX 480 as the 18th fastest machine on the top 20 list.

Note that to PASS the result must pass a strict ulp test as compared to the same result on a CPU, so we are not observing any issues w.r.t. ECC.

Some new results from my brand new GTX 460M card (60GB/s peak bandwidth):

Single-precision:

./streamsp Starting...

Running on...

Device 0: GeForce GTX 460M

 Array size (single precision) = 8000000

 using 192 threads per block, 41667 blocks

 device STREAM_Copy:                              Pass

 device STREAM_Copy_Optimized:          Pass

 device STREAM_Scale:                             Pass

 device STREAM_Scale_Optimized:         Pass

 device STREAM_Add:                               Pass

 device STREAM_Add_Optimzied:           Pass

 device STREAM_Triad:                             Pass

 device STREAM_Triad_Optimized:         Pass

Function       Rate (MB/s)    Avg time      Min time      Max time

Copy:          36338.5279     0.001765      0.001761      0.001769

Copy Opt:   37972.9998     0.001689      0.001685      0.001693

Scale:         36062.0260     0.001777      0.001775      0.001780

Scale Opt:  37688.2052     0.001701      0.001698      0.001704

Add:            45023.4090     0.002135      0.002132      0.002138

Add Opt:     47089.8466     0.002045      0.002039      0.002049

Triad:          45155.5597     0.002132      0.002126      0.002140

Triad Opt:   47114.2486     0.002045      0.002038      0.002049

[streamBenchmark] - results:    PASSES

-----------------------------------------------------------

Double-precision:

./streamdp Starting...

Device 0: GeForce GTX 460M

 Array size (double precision) = 8000000

 using 192 threads per block, 41667 blocks

 device STREAM_Copy:                              Pass

 device STREAM_Copy_Optimized:          Pass

 device STREAM_Scale:                             Pass

 device STREAM_Scale_Optimized:         Pass

 device STREAM_Add:                                Pass

 device STREAM_Add_Optimzied:           Pass

 device STREAM_Triad:                             Pass

 device STREAM_Triad_Optimized:         Pass

Function       Rate (MB/s)    Avg time      Min time      Max time

Copy:          13534.4098     0.009485     0.009457     0.009522

Copy Opt:   13549.6302     0.009484     0.009447     0.009514

Scale:          13519.9083     0.009495     0.009468     0.009533

Scale Opt:   13536.9279     0.009485     0.009456     0.009519

Add:             13461.6371     0.014279     0.014263     0.014311

Add Opt:      13470.7648     0.014261     0.014253     0.014292

Triad:           13438.1505     0.014295     0.014288     0.014334

Triad Opt:    13465.9272     0.014275     0.014258     0.014316

[streamBenchmark] - results:    PASSES

Results taken using the CUDA 4.0 RC 2 release on a Fedora 15-alpha system … the descrepancy of the double-precision STREAM results between the GTX 460 and GTX 460M (mobile) is somewhat curious.

I’ve installed the CUDA 4.0 RC2 SDK and recompiled my stream benchmark implementation. The single-precision results haven’t changed, but I am now seeing substantially better double-precision stream benchmark results:

./streamdp Starting...

Device 0: GeForce GTX 460M

 Array size (double precision) = 8000000

 using 192 threads per block, 41667 blocks

 device STREAM_Copy:                    Pass

 device STREAM_Copy_Optimized:          Pass

 device STREAM_Scale:                   Pass

 device STREAM_Scale_Optimized:         Pass

 device STREAM_Add:                     Pass

 device STREAM_Add_Optimzied:           Pass

 device STREAM_Triad:                   Pass

 device STREAM_Triad_Optimized:         Pass

Function    Rate (MB/s)    Avg time      Min time      Max time

Copy:       49670.9253     0.002592     0.002577     0.002596

Copy Opt:   50025.6394     0.002573     0.002559     0.002577

Scale:      49402.2294     0.002606     0.002591     0.002610

Scale Opt:  49777.2442     0.002590     0.002571     0.002594

Add:        49305.6104     0.003902     0.003894     0.003905

Add Opt:    49397.7549     0.003895     0.003887     0.003900

Triad:      49195.2450     0.003909     0.003903     0.003913

Triad Opt:  49336.8262     0.003900     0.003892     0.003903

[streamBenchmark] - results:    PASSES

-----------------------------------------------------------

I’ve converted streamsp.cu codes supplied above to double precision.
Results for C2050 are:

[Double-Precision Device-Only STREAM Benchmark implementation in CUDA]
./streamdp Starting…

Running on…

Device 0: Tesla C2050
Array size (double precision) = 2000000
using 128 threads per block, 15625 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Add: Pass
device STREAM_Triad: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 121684.1021 0.000264 0.000263 0.000265
Copy Opt.: 122399.0072 0.000262 0.000261 0.000263
Scale: 121080.0277 0.000266 0.000264 0.000267
Add: 128161.3133 0.000375 0.000375 0.000375
Triad: 127888.1294 0.000376 0.000375 0.000376

[streamBenchmark] - results: PASSES

But I obtain some warning at streamdp building:
ptxas /tmp/tmpxft_00005ed3_00000000-2_streamdp.compute_10.ptx, line 80; warning : Double is not
supported. Demoting to float

Dear Douglas, could you pls supply here your dp version - to be sure, that I converted your source correctly ?

Mikhail

Hi Mikhail,

Your results look good … i.e. not very much less than the max device bandwidth of 144 GB/s. Please remember that the SDK generates executables which can run on all CUDA capable devices and double precision floating point is not available on compute capability (CC) 1.0, 1.1, and 1.2 devices … see appendix G.1 of the CUDA C Programming Guide (v. 3.2). So when generating CC 1.0 binaries, doubles are demoted to floats, just like the warning says.

Given you are using a Tesla C2050, this isn’t a big deal unless for some perverse reason you would want to run a CC 1.0 binary on your Tesla C2050. (I’m not even sure this is even possible …)

For the sake of completeness attached is my double precision implementation.

Doug
streamdp.cu (22.1 KB)

Dear Douglas, thanks !

I obtained the same warning :-))

Results are very close:

./streamdp_orig

[Double-Precision Device-Only STREAM Benchmark implementation in CUDA]

./streamdp_orig Starting…

Device 0: Tesla C2050

Array size (double precision) = 8000000

using 128 threads per block, 62500 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Scale_Optimized: Pass

device STREAM_Add: Pass

device STREAM_Add_Optimzied: Pass

device STREAM_Triad: Pass

device STREAM_Triad_Optimized: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 124362.6414 0.001031 0.001029 0.001032

Copy Opt: 124968.7585 0.001026 0.001024 0.001027

Scale: 123789.1843 0.001035 0.001034 0.001037

Scale Opt: 124633.8815 0.001028 0.001027 0.001031

Add: 130375.2548 0.001473 0.001473 0.001474

Add Opt: 130468.8095 0.001472 0.001472 0.001473

Triad: 130363.9292 0.001474 0.001473 0.001474

Triad Opt: 130465.9816 0.001473 0.001472 0.001473

[streamBenchmark] - results: PASSES

Mikhail

Hi,

I downloaded your code streamdp.cu code and compild and ran it on a Tesla M2090. However, when I compile I get the following warning:

ptxas /tmp/tmpxft_00007f65_00000000-5_stream.compute_10.ptx, line 78; warning : Double is not supported. Demoting to float

My asumption was that M2090 supports double precision. Is that true?

This is how I am compiling the streamdp code:

Add source files here

EXECUTABLE := stream

Cuda source files (compiled with cudacc)

CUFILES := stream.cu

C/C++ source files (compiled with gcc / c++)

CCFILES :=

CUDACCFLAGS := -arch sm_20

################################################################################

Rules and targets

include …/…/common/common.mk

looking for feedback. Thanks!

Hello Indraneilmg,

According to the Tesla M-Class GPU Computing Modules, it is claimed that 665 GFlops of double-precision peak performance is offered by the Tesla M2090 GPU. However, the Tesla M2090 is not listed in Appendix A (“CUDA-Enabled GPUs”) of the 4.1RC2 release of the CUDA C Programming Guide document (v. 4.1, 11/18/2011). It is also not contained in Appendix A of the 4.0 production release of the CUDA C Programming Guide (v. 4.0, 5/6/2011). Given this lack of document concerning the compute capability of the Telsa M2090, it is somewhat difficult to determine if the device does indeed support double precision operations.

However, it is likely that the device does indeed support double precision given the cited product brief. I would recommend running deviceQuery on the device you do have and see what the reported compute capability of it is. According to Table F-1 of the 4.1 RC2 CUDA C Programming Guide, compute capability 1.3 and greater devices do support “double-precision floating-point numbers”. So if the reported capability is indeed compute capability 1.3 or greater, then your device is capable of performing double-precision operations.

If this is the case, please see my my previous comment about the SDK compiling code for all compute devices, including those which do not support double precision (compute capability 1.0, 1.1., and 1.2 devices). The results of the double precision STREAM implementation is not valid on compute capability 1.0, 1.1, and 1.2 devices due to the lack of double-precision support.

dpe

Thank you for alerting us to the omission of the Tesla M2090 from table A-1 of the CUDA C Programming Guide. The M2090 is a compute capability 2.0 part and supports double-precision computation (to the tune of 665 GFLOPS). See http://www.nvidia.com/docs/IO/105880/DS-Tesla-M-Class-Aug11.pdf

I want to build stream application for cuda-5.5 under OpenSUSE 12.3.
I put stream directory into 0_Simple dir of Simple subsystem and set Makefile according (as I think) Simple-5.5 rules.
But after issue of “make” I see:

“/usr/local/cuda-5.5”/bin/nvcc -ccbin g++ -I…/…/common/inc -m64 -gencode arch=compute_10,code=sm_10 -gencode

arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=

"sm_35,compute_35" -o streamdp_orig.o -c streamdp_orig.cu
streamdp_orig.cu:26:26: fatal error: cutil_inline.h: No such file or directory
compilation terminated.
make[1]: *** [streamdp_orig.o] Error 1
make[1]: Leaving directory `/root/NVIDIA_CUDA-5.5_Samples/NVIDIA_CUDA-5.5_Samples/0_Simple/stream’
make: *** [0_Simple/stream/Makefile.ph_build] Error 2

I don’t see cutil_inline.h file in Samples directories tree, in particular in include directory.

How should I correct the situation (stream.cu source ?) ?

Mikhail