Stream Benchmark
  1 / 2    
I recently discovered the Stream Benchmark (http://www.cs.virginia.edu/stream/). Is there a CUDA port for this benchmark?

Would it make sense to write a CUDA port and compare the results? If yes questions concerning coalesced memory access and shared memory arise. Any thoughts?
I recently discovered the Stream Benchmark (http://www.cs.virginia.edu/stream/). Is there a CUDA port for this benchmark?



Would it make sense to write a CUDA port and compare the results? If yes questions concerning coalesced memory access and shared memory arise. Any thoughts?

#1
Posted 11/29/2007 05:00 PM   
I did a quick port.
These are the initial results on a C870.

STREAM Benchmark implementation in CUDA
Array size (single precision)=2000000
using 128 threads per block, 15625 blocks
Function Rate (MB/s) Avg time Min time Max time
Copy: 64021.5140 0.0003 0.0002 0.0003
Scale: 64273.1863 0.0003 0.0002 0.0003
Add: 65413.3757 0.0004 0.0004 0.0004
Triad: 65411.1342 0.0004 0.0004 0.0004

If you want to increase the array size, you can increase the number of threads per block, or change the mapping from local thread idx to global.

The source code is attached (rename it to stream.cu and compile it with nvcc -O3 -o stream stream.cu). It is using a Unix timer.
I did a quick port.

These are the initial results on a C870.



STREAM Benchmark implementation in CUDA

Array size (single precision)=2000000

using 128 threads per block, 15625 blocks

Function Rate (MB/s) Avg time Min time Max time

Copy: 64021.5140 0.0003 0.0002 0.0003

Scale: 64273.1863 0.0003 0.0002 0.0003

Add: 65413.3757 0.0004 0.0004 0.0004

Triad: 65411.1342 0.0004 0.0004 0.0004



If you want to increase the array size, you can increase the number of threads per block, or change the mapping from local thread idx to global.



The source code is attached (rename it to stream.cu and compile it with nvcc -O3 -o stream stream.cu). It is using a Unix timer.
Attachments

stream.cu.txt

#2
Posted 12/03/2007 02:03 AM   
Regarding the CUDA port for the stream benchmark, could someone please explain the purpose for skipping the first iteration when calculating the average, min, and max times?

Here is the relevant section of code from mfatica's port:

[codebox]/* --- MAIN LOOP --- repeat test cases NTIMES times --- */

scalar=3.0f;
for (k=0; k<NTIMES; k++)
{
times[0][k]= mysecond();
STREAM_Copy<<<dimGrid,dimBlock>>>(d_a, d_c, N);
cudaThreadSynchronize();
times[0][k]= mysecond() - times[0][k];

times[1][k]= mysecond();
STREAM_Scale<<<dimGrid,dimBlock>>>(d_b, d_c, scalar, N);
cudaThreadSynchronize();
times[1][k]= mysecond() - times[1][k];

times[2][k]= mysecond();
STREAM_Add<<<dimGrid,dimBlock>>>(d_a, d_b, d_c, N);
cudaThreadSynchronize();
times[2][k]= mysecond() - times[2][k];

times[3][k]= mysecond();
STREAM_Triad<<<dimGrid,dimBlock>>>(d_b, d_c, d_a, scalar, N);
cudaThreadSynchronize();
times[3][k]= mysecond() - times[3][k];
}

/* --- SUMMARY --- */

for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
{
for (j=0; j<4; j++)
{
avgtime[j] = avgtime[j] + times[j][k];
mintime[j] = MIN(mintime[j], times[j][k]);
maxtime[j] = MAX(maxtime[j], times[j][k]);
}
}

printf("Function Rate (MB/s) Avg time Min time Max time\n");
for (j=0; j<4; j++) {
avgtime[j] = avgtime[j]/(double)(NTIMES-1);

printf("%s%11.4f %11.4f %11.4f %11.4f\n", label[j],
1.0E-06 * bytes[j]/mintime[j],
avgtime[j],
mintime[j],
maxtime[j]);
}
[/codebox]
Regarding the CUDA port for the stream benchmark, could someone please explain the purpose for skipping the first iteration when calculating the average, min, and max times?



Here is the relevant section of code from mfatica's port:



[codebox]/* --- MAIN LOOP --- repeat test cases NTIMES times --- */



scalar=3.0f;

for (k=0; k<NTIMES; k++)

{

times[0][k]= mysecond();

STREAM_Copy<<<dimGrid,dimBlock>>>(d_a, d_c, N);

cudaThreadSynchronize();

times[0][k]= mysecond() - times[0][k];



times[1][k]= mysecond();

STREAM_Scale<<<dimGrid,dimBlock>>>(d_b, d_c, scalar, N);

cudaThreadSynchronize();

times[1][k]= mysecond() - times[1][k];



times[2][k]= mysecond();

STREAM_Add<<<dimGrid,dimBlock>>>(d_a, d_b, d_c, N);

cudaThreadSynchronize();

times[2][k]= mysecond() - times[2][k];



times[3][k]= mysecond();

STREAM_Triad<<<dimGrid,dimBlock>>>(d_b, d_c, d_a, scalar, N);

cudaThreadSynchronize();

times[3][k]= mysecond() - times[3][k];

}



/* --- SUMMARY --- */



for (k=1; k<NTIMES; k++) /* note -- skip first iteration */

{

for (j=0; j<4; j++)

{

avgtime[j] = avgtime[j] + times[j][k];

mintime[j] = MIN(mintime[j], times[j][k]);

maxtime[j] = MAX(maxtime[j], times[j][k]);

}

}



printf("Function Rate (MB/s) Avg time Min time Max time\n");

for (j=0; j<4; j++) {

avgtime[j] = avgtime[j]/(double)(NTIMES-1);



printf("%s%11.4f %11.4f %11.4f %11.4f\n", label[j],

1.0E-06 * bytes[j]/mintime[j],

avgtime[j],

mintime[j],

maxtime[j]);

}

[/codebox]

#3
Posted 06/16/2009 05:11 PM   
It is the same in the original CPU code, you may want to ask the author of the CPU code.
It is the same in the original CPU code, you may want to ask the author of the CPU code.

#4
Posted 06/16/2009 06:01 PM   
[quote name='ndv14' post='553327' date='Jun 16 2009, 07:11 PM']Regarding the CUDA port for the stream benchmark, could someone please explain the purpose for skipping the first iteration when calculating the average, min, and max times?[/quote]

The first call to CUDA in a program is usually slow, as it needs to load the kernels to the GPU, initialize the drivers, etc. That's why it makes sense to skip it.
[quote name='ndv14' post='553327' date='Jun 16 2009, 07:11 PM']Regarding the CUDA port for the stream benchmark, could someone please explain the purpose for skipping the first iteration when calculating the average, min, and max times?



The first call to CUDA in a program is usually slow, as it needs to load the kernels to the GPU, initialize the drivers, etc. That's why it makes sense to skip it.

#5
Posted 06/18/2009 11:05 AM   
[quote name='mfatica' date='02 December 2007 - 06:03 PM' timestamp='1196647394' post='288293']
I did a quick port.
These are the initial results on a C870.

... snip ...

If you want to increase the array size, you can increase the number of threads per block, or change the mapping from local thread idx to global.

The source code is attached (rename it to stream.cu and compile it with nvcc -O3 -o stream stream.cu). It is using a Unix timer.
[/quote]

I have gone ahead a fixed some errors associated with the original poster's implementaion, namely timing device-side operations using host-side timers is a tricky issue given the two concurrent execution contexts. See section 6.3 "Measuring Performance with Events" in "CUDA by Example" by Sanders and Kandrot.

The implementation requires that one uses it within the SDK development environment, i.e. it is dependent upon the shrLog'ing facility and associated cputimers. I have also included a Makefile which works with SDK environment, so it should be easy to run this example within the SDK examples directory.

Some timings for two of the NVIDIA GPUs I have are also included

[code]
../../bin/linux/release/streamsp Starting...

Running on...

Device 0: GeForce GTX 260
Array size (single precision) = 2000000
using 128 threads per block, 15625 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Add: Pass
device STREAM_Triad: Pass
Function Rate (GB/s) Avg time Min time Max time
Copy: 3.482137e+01 4.603663e-04 4.594880e-04 4.617280e-04
Copy Opt.: 8.869966e+01 1.848782e-04 1.803840e-04 2.036160e-04
Scale: 3.455903e+01 4.640000e-04 4.629760e-04 4.662080e-04
Add: 4.920614e+01 4.890560e-04 4.877440e-04 4.947200e-04
Triad: 4.853426e+01 4.954027e-04 4.944960e-04 4.979840e-04

[streamBenchmark] - results: PASSES


Press <Enter> to Quit...
-----------------------------------------------------------
[/code]

and

[code]
../../bin/linux/release/streamsp Starting...

Running on...

Device 1: GeForce 8400 GS
Array size (single precision) = 2000000
using 128 threads per block, 15625 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Add: Pass
device STREAM_Triad: Pass
Function Rate (GB/s) Avg time Min time Max time
Copy: 1.594210e+00 1.003930e-02 1.003632e-02 1.004387e-02
Copy Opt.: 4.315181e+00 3.724751e-03 3.707840e-03 3.738144e-03
Scale: 1.554243e+00 1.029760e-02 1.029440e-02 1.030131e-02
Add: 2.245959e+00 1.069060e-02 1.068586e-02 1.069514e-02
Triad: 2.224984e+00 1.079055e-02 1.078659e-02 1.079536e-02

[streamBenchmark] - results: PASSES


Press <Enter> to Quit...
-----------------------------------------------------------
[/code]

The Copy Optimized implementation is device grid thread-block dependent.

dpe
[quote name='mfatica' date='02 December 2007 - 06:03 PM' timestamp='1196647394' post='288293']

I did a quick port.

These are the initial results on a C870.



... snip ...



If you want to increase the array size, you can increase the number of threads per block, or change the mapping from local thread idx to global.



The source code is attached (rename it to stream.cu and compile it with nvcc -O3 -o stream stream.cu). It is using a Unix timer.





I have gone ahead a fixed some errors associated with the original poster's implementaion, namely timing device-side operations using host-side timers is a tricky issue given the two concurrent execution contexts. See section 6.3 "Measuring Performance with Events" in "CUDA by Example" by Sanders and Kandrot.



The implementation requires that one uses it within the SDK development environment, i.e. it is dependent upon the shrLog'ing facility and associated cputimers. I have also included a Makefile which works with SDK environment, so it should be easy to run this example within the SDK examples directory.



Some timings for two of the NVIDIA GPUs I have are also included





../../bin/linux/release/streamsp Starting...



Running on...



Device 0: GeForce GTX 260

Array size (single precision) = 2000000

using 128 threads per block, 15625 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Add: Pass

device STREAM_Triad: Pass

Function Rate (GB/s) Avg time Min time Max time

Copy: 3.482137e+01 4.603663e-04 4.594880e-04 4.617280e-04

Copy Opt.: 8.869966e+01 1.848782e-04 1.803840e-04 2.036160e-04

Scale: 3.455903e+01 4.640000e-04 4.629760e-04 4.662080e-04

Add: 4.920614e+01 4.890560e-04 4.877440e-04 4.947200e-04

Triad: 4.853426e+01 4.954027e-04 4.944960e-04 4.979840e-04



[streamBenchmark] - results: PASSES





Press <Enter> to Quit...

-----------------------------------------------------------




and





../../bin/linux/release/streamsp Starting...



Running on...



Device 1: GeForce 8400 GS

Array size (single precision) = 2000000

using 128 threads per block, 15625 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Add: Pass

device STREAM_Triad: Pass

Function Rate (GB/s) Avg time Min time Max time

Copy: 1.594210e+00 1.003930e-02 1.003632e-02 1.004387e-02

Copy Opt.: 4.315181e+00 3.724751e-03 3.707840e-03 3.738144e-03

Scale: 1.554243e+00 1.029760e-02 1.029440e-02 1.030131e-02

Add: 2.245959e+00 1.069060e-02 1.068586e-02 1.069514e-02

Triad: 2.224984e+00 1.079055e-02 1.078659e-02 1.079536e-02



[streamBenchmark] - results: PASSES





Press <Enter> to Quit...

-----------------------------------------------------------




The Copy Optimized implementation is device grid thread-block dependent.



dpe

#6
Posted 12/05/2010 01:03 AM   
[quote name='dpephd' date='04 December 2010 - 05:03 PM' timestamp='1291511033' post='1155908']
I have gone ahead a fixed some errors associated with the original poster's implementation, namely timing device-side operations using host-side timers is a tricky issue given the two concurrent execution contexts. See section 6.3 "Measuring Performance with Events" in "CUDA by Example" by Sanders and Kandrot.
[/quote]

I have fixed some of my own errors (*gasp*) and now the implementation is working in a manner consistent with the original implementation idea. The previously posted results have not really changed based upon this updated implementation.

Running on my GTX460 (compute capability 2.1) card results in the following:

[code]
[Single-Precision Device-Only STREAM Benchmark implementation in CUDA]
../../bin/linux/release/streamsp Starting...

Running on...

Device 0: GeForce GTX 460
Array size (single precision) = 2000000
using 192 threads per block, 10417 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Add: Pass
device STREAM_Triad: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 68549.4869 0.000234 0.000233 0.000235
Copy Opt.: 75711.6877 0.000212 0.000211 0.000214
Scale: 68418.1717 0.000235 0.000234 0.000238
Add: 89637.8610 0.000268 0.000268 0.000269
Triad: 89595.0289 0.000269 0.000268 0.000269

[streamBenchmark] - results: PASSES
[/code]

Note that the implementation uses 128 threads per block for GF100/GT200 (Compute Capability 2.0 or below) architectures and 192 threads per block for GTX460 (Compute Capability 2.1) given that these threads per block counts yielded best (highest) STREAM processing rates given the very limited amount of playing around with this parameter that I've done.

It is interesting to note that at least for a compute capability 2.x system, the performance difference between regular Copy() and optimized Copy() is not as dramatic as with compute capability 1.x systems. It is not really clear to me why this is. Any thoughts concerning this would be appreciated. I am using CUDA 3.2 so it is not a question of different software versions.

dpe
[quote name='dpephd' date='04 December 2010 - 05:03 PM' timestamp='1291511033' post='1155908']

I have gone ahead a fixed some errors associated with the original poster's implementation, namely timing device-side operations using host-side timers is a tricky issue given the two concurrent execution contexts. See section 6.3 "Measuring Performance with Events" in "CUDA by Example" by Sanders and Kandrot.





I have fixed some of my own errors (*gasp*) and now the implementation is working in a manner consistent with the original implementation idea. The previously posted results have not really changed based upon this updated implementation.



Running on my GTX460 (compute capability 2.1) card results in the following:





[Single-Precision Device-Only STREAM Benchmark implementation in CUDA]

../../bin/linux/release/streamsp Starting...



Running on...



Device 0: GeForce GTX 460

Array size (single precision) = 2000000

using 192 threads per block, 10417 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Add: Pass

device STREAM_Triad: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 68549.4869 0.000234 0.000233 0.000235

Copy Opt.: 75711.6877 0.000212 0.000211 0.000214

Scale: 68418.1717 0.000235 0.000234 0.000238

Add: 89637.8610 0.000268 0.000268 0.000269

Triad: 89595.0289 0.000269 0.000268 0.000269



[streamBenchmark] - results: PASSES




Note that the implementation uses 128 threads per block for GF100/GT200 (Compute Capability 2.0 or below) architectures and 192 threads per block for GTX460 (Compute Capability 2.1) given that these threads per block counts yielded best (highest) STREAM processing rates given the very limited amount of playing around with this parameter that I've done.



It is interesting to note that at least for a compute capability 2.x system, the performance difference between regular Copy() and optimized Copy() is not as dramatic as with compute capability 1.x systems. It is not really clear to me why this is. Any thoughts concerning this would be appreciated. I am using CUDA 3.2 so it is not a question of different software versions.



dpe
Attachments

streamsp.cu

#7
Posted 12/07/2010 04:35 PM   
Some additional double-precision results for my GeForce GTX 460 including optimized versions of the regular STREAM operations.

[code]
[Double-Precision Device-Only STREAM Benchmark implementation in CUDA]
../../bin/linux/release/streamdp Starting...

Device 0: GeForce GTX 460
Array size (double precision) = 2000000
using 192 threads per block, 10417 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Scale_Optimized: Pass
device STREAM_Add: Pass
device STREAM_Add_Optimzied: Pass
device STREAM_Triad: Pass
device STREAM_Triad_Optimized: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 100755.6574 0.000320 0.000318 0.000322
Copy Opt: 101502.2289 0.000318 0.000315 0.000319
Scale: 100090.0690 0.000323 0.000320 0.000325
Scale Opt: 100928.5353 0.000320 0.000317 0.000321
Add: 101180.4315 0.000477 0.000474 0.000479
Add Opt: 101385.6003 0.000476 0.000473 0.000477
Triad: 100948.9154 0.000479 0.000475 0.000481
Triad Opt: 101119.0476 0.000477 0.000475 0.000479

[streamBenchmark] - results: PASSES

[/code]

Using CUDA 3.2. Pretty amazing all these results are within ~85% of the peak bandwidth performance of the card and over 100 GB/sec.
Some additional double-precision results for my GeForce GTX 460 including optimized versions of the regular STREAM operations.





[Double-Precision Device-Only STREAM Benchmark implementation in CUDA]

../../bin/linux/release/streamdp Starting...



Device 0: GeForce GTX 460

Array size (double precision) = 2000000

using 192 threads per block, 10417 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Scale_Optimized: Pass

device STREAM_Add: Pass

device STREAM_Add_Optimzied: Pass

device STREAM_Triad: Pass

device STREAM_Triad_Optimized: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 100755.6574 0.000320 0.000318 0.000322

Copy Opt: 101502.2289 0.000318 0.000315 0.000319

Scale: 100090.0690 0.000323 0.000320 0.000325

Scale Opt: 100928.5353 0.000320 0.000317 0.000321

Add: 101180.4315 0.000477 0.000474 0.000479

Add Opt: 101385.6003 0.000476 0.000473 0.000477

Triad: 100948.9154 0.000479 0.000475 0.000481

Triad Opt: 101119.0476 0.000477 0.000475 0.000479



[streamBenchmark] - results: PASSES






Using CUDA 3.2. Pretty amazing all these results are within ~85% of the peak bandwidth performance of the card and over 100 GB/sec.

#8
Posted 12/11/2010 11:42 PM   
[quote name='ndv14' date='16 June 2009 - 11:11 AM' timestamp='1245172306' post='553327']
Regarding the CUDA port for the stream benchmark, could someone please explain the purpose for skipping the first iteration when calculating the average, min, and max times?

STREAM skips the first iteration when calculating the average, min, and max times because the amount of clean and dirty data in the caches is different when the first iteration executes and this might systematically bias the results. The contents of the caches should be (statistically) the same after the first iteration, so the execution of the second and subsequent iterations should result in the same number of cast-outs/victims.

This could also be used to intentionally "cheat". For example if all the caches were invalidated before a kernel test, the caches would not need to perform any cast-outs (clean or dirty victims) until the cache overflowed due to capacity. If the array size is at the minimum allowed (4x the aggregate cache size), this could produce a small reduction in execution time. The reason that this is not allowed is that the benchmark cannot force all of the data written to memory -- the kernel ends (and the timing is recorded) when the final data is stored into the cache. The benchmark depends on dirty cast-outs of data that was in the cache before the kernel begins execution to balance the lack of cast-outs of the final cache state containing some fraction of modified data.

This is probably less important on GPUs and when compiling CPU code with streaming stores, but it seems like a reasonable precaution to leave in the code.

I am just beginning to learn about cuda programming on NVIDIA GPUs, but my variant of Massimiliano Fatica's cuda STREAM code was giving up to 105 GB/s using "float" and up to 108 GB/s using "double" on an M2050 system this weekend.
[quote name='ndv14' date='16 June 2009 - 11:11 AM' timestamp='1245172306' post='553327']

Regarding the CUDA port for the stream benchmark, could someone please explain the purpose for skipping the first iteration when calculating the average, min, and max times?



STREAM skips the first iteration when calculating the average, min, and max times because the amount of clean and dirty data in the caches is different when the first iteration executes and this might systematically bias the results. The contents of the caches should be (statistically) the same after the first iteration, so the execution of the second and subsequent iterations should result in the same number of cast-outs/victims.



This could also be used to intentionally "cheat". For example if all the caches were invalidated before a kernel test, the caches would not need to perform any cast-outs (clean or dirty victims) until the cache overflowed due to capacity. If the array size is at the minimum allowed (4x the aggregate cache size), this could produce a small reduction in execution time. The reason that this is not allowed is that the benchmark cannot force all of the data written to memory -- the kernel ends (and the timing is recorded) when the final data is stored into the cache. The benchmark depends on dirty cast-outs of data that was in the cache before the kernel begins execution to balance the lack of cast-outs of the final cache state containing some fraction of modified data.



This is probably less important on GPUs and when compiling CPU code with streaming stores, but it seems like a reasonable precaution to leave in the code.



I am just beginning to learn about cuda programming on NVIDIA GPUs, but my variant of Massimiliano Fatica's cuda STREAM code was giving up to 105 GB/s using "float" and up to 108 GB/s using "double" on an M2050 system this weekend.

#9
Posted 02/07/2011 04:35 PM   
[quote name='John D. McCalpin' date='07 February 2011 - 08:35 AM' timestamp='1297096529' post='1190144']

I am just beginning to learn about cuda programming on NVIDIA GPUs, but my variant of Massimiliano Fatica's cuda STREAM code was giving up to 105 GB/s using "float" and up to 108 GB/s using "double" on an M2050 system this weekend.
[/quote]

Hi John,

This is dpephd (Doug Enright) who produced the updated streamsp.cu implementation and results posted to this thread. It is really great to see the originator of the STREAM benchmark starting to examine GPUs for their STREAM processing capabilities.

I would be happy to share my single and double precision STREAM implementations with you. I am somewhat concerned that Massimiliano's implementation is too "optimized" and is not capable of processing the various stream kernels for arbitrary sized inputs which I would think you would want when reporting "standard STREAM processing" results for. Do you have any thoughts about this?

Also, has anyone reported GPU STREAM results to you for inclusion on the various results lists you maintain? The results I was able to get with my consumer GeForce GTX 460 card would place the card in the top 20 of all machines reported on your [url="http://www.cs.virginia.edu/stream/top20/Bandwidth.html"]"top 20" standard STREAM benchmark results[/url].

Best,

Doug (dpephd-nvidia@yahoo.com)
[quote name='John D. McCalpin' date='07 February 2011 - 08:35 AM' timestamp='1297096529' post='1190144']



I am just beginning to learn about cuda programming on NVIDIA GPUs, but my variant of Massimiliano Fatica's cuda STREAM code was giving up to 105 GB/s using "float" and up to 108 GB/s using "double" on an M2050 system this weekend.





Hi John,



This is dpephd (Doug Enright) who produced the updated streamsp.cu implementation and results posted to this thread. It is really great to see the originator of the STREAM benchmark starting to examine GPUs for their STREAM processing capabilities.



I would be happy to share my single and double precision STREAM implementations with you. I am somewhat concerned that Massimiliano's implementation is too "optimized" and is not capable of processing the various stream kernels for arbitrary sized inputs which I would think you would want when reporting "standard STREAM processing" results for. Do you have any thoughts about this?



Also, has anyone reported GPU STREAM results to you for inclusion on the various results lists you maintain? The results I was able to get with my consumer GeForce GTX 460 card would place the card in the top 20 of all machines reported on your "top 20" standard STREAM benchmark results.



Best,



Doug (dpephd-nvidia@yahoo.com)

#10
Posted 02/08/2011 07:02 AM   
[quote name='John D. McCalpin' date='07 February 2011 - 08:35 AM' timestamp='1297096529' post='1190144']
I am just beginning to learn about cuda programming on NVIDIA GPUs, but my variant of Massimiliano Fatica's cuda STREAM code was giving up to 105 GB/s using "float" and up to 108 GB/s using "double" on an M2050 system this weekend.
[/quote]

We (myself and Wen Zheng) just ran the STREAM benchmark on a GTX 480 with 177 GB/s peak bandwidth ... below is the single and double precision results

[code]
Device 0: GeForce GTX 480
Array size (single precision) = 8000000
using 192 threads per block, 41667 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Scale_Optimized: Pass
device STREAM_Add: Pass
device STREAM_Add_Optimzied: Pass
device STREAM_Triad: Pass
device STREAM_Triad_Optimized: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 148544.2617 0.000432 0.000431 0.000433
Copy Opt: 147645.0564 0.000435 0.000433 0.000440
Scale: 147863.3671 0.000434 0.000433 0.000438
Scale Opt: 146810.5303 0.000436 0.000436 0.000438
Add: 160136.6409 0.000603 0.000599 0.000606
Add Opt: 159379.4777 0.000603 0.000602 0.000604
Triad: 159957.3337 0.000602 0.000600 0.000603
Triad Opt: 159218.7519 0.000603 0.000603 0.000603
[/code]

[code]
Device 0: GeForce GTX 480
Array size (double precision) = 2000000
using 128 threads per block, 15625 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Scale_Optimized: Pass
device STREAM_Add: Pass
device STREAM_Add_Optimzied: Pass
device STREAM_Triad: Pass
device STREAM_Triad_Optimized: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 156079.2772 0.000206 0.000205 0.000209
Copy Opt: 155811.7711 0.000206 0.000205 0.000207
Scale: 155666.2449 0.000206 0.000206 0.000206
Scale Opt: 155255.3928 0.000207 0.000206 0.000207
Add: 160496.4640 0.000301 0.000299 0.000302
Add Opt: 160496.4640 0.000301 0.000299 0.000303
Triad: 159320.2333 0.000302 0.000301 0.000302
Triad Opt: 159320.2333 0.000302 0.000301 0.000303
[/code]

~160 GB/s Triad bandwidth ... truly remarkable! This result would place the GTX 480 as the 18th fastest machine on the top 20 list.

Note that to PASS the result must pass a strict ulp test as compared to the same result on a CPU, so we are not observing any issues w.r.t. ECC.
[quote name='John D. McCalpin' date='07 February 2011 - 08:35 AM' timestamp='1297096529' post='1190144']

I am just beginning to learn about cuda programming on NVIDIA GPUs, but my variant of Massimiliano Fatica's cuda STREAM code was giving up to 105 GB/s using "float" and up to 108 GB/s using "double" on an M2050 system this weekend.





We (myself and Wen Zheng) just ran the STREAM benchmark on a GTX 480 with 177 GB/s peak bandwidth ... below is the single and double precision results





Device 0: GeForce GTX 480

Array size (single precision) = 8000000

using 192 threads per block, 41667 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Scale_Optimized: Pass

device STREAM_Add: Pass

device STREAM_Add_Optimzied: Pass

device STREAM_Triad: Pass

device STREAM_Triad_Optimized: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 148544.2617 0.000432 0.000431 0.000433

Copy Opt: 147645.0564 0.000435 0.000433 0.000440

Scale: 147863.3671 0.000434 0.000433 0.000438

Scale Opt: 146810.5303 0.000436 0.000436 0.000438

Add: 160136.6409 0.000603 0.000599 0.000606

Add Opt: 159379.4777 0.000603 0.000602 0.000604

Triad: 159957.3337 0.000602 0.000600 0.000603

Triad Opt: 159218.7519 0.000603 0.000603 0.000603






Device 0: GeForce GTX 480

Array size (double precision) = 2000000

using 128 threads per block, 15625 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Scale_Optimized: Pass

device STREAM_Add: Pass

device STREAM_Add_Optimzied: Pass

device STREAM_Triad: Pass

device STREAM_Triad_Optimized: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 156079.2772 0.000206 0.000205 0.000209

Copy Opt: 155811.7711 0.000206 0.000205 0.000207

Scale: 155666.2449 0.000206 0.000206 0.000206

Scale Opt: 155255.3928 0.000207 0.000206 0.000207

Add: 160496.4640 0.000301 0.000299 0.000302

Add Opt: 160496.4640 0.000301 0.000299 0.000303

Triad: 159320.2333 0.000302 0.000301 0.000302

Triad Opt: 159320.2333 0.000302 0.000301 0.000303




~160 GB/s Triad bandwidth ... truly remarkable! This result would place the GTX 480 as the 18th fastest machine on the top 20 list.



Note that to PASS the result must pass a strict ulp test as compared to the same result on a CPU, so we are not observing any issues w.r.t. ECC.

#11
Posted 02/24/2011 06:30 AM   
Some new results from my brand new GTX 460M card (60GB/s peak bandwidth):

Single-precision:

[code]
./streamsp Starting...

Running on...

Device 0: GeForce GTX 460M
Array size (single precision) = 8000000
using 192 threads per block, 41667 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Scale_Optimized: Pass
device STREAM_Add: Pass
device STREAM_Add_Optimzied: Pass
device STREAM_Triad: Pass
device STREAM_Triad_Optimized: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 36338.5279 0.001765 0.001761 0.001769
Copy Opt: 37972.9998 0.001689 0.001685 0.001693
Scale: 36062.0260 0.001777 0.001775 0.001780
Scale Opt: 37688.2052 0.001701 0.001698 0.001704
Add: 45023.4090 0.002135 0.002132 0.002138
Add Opt: 47089.8466 0.002045 0.002039 0.002049
Triad: 45155.5597 0.002132 0.002126 0.002140
Triad Opt: 47114.2486 0.002045 0.002038 0.002049

[streamBenchmark] - results: PASSES

-----------------------------------------------------------
[/code]

Double-precision:
[code]
./streamdp Starting...

Device 0: GeForce GTX 460M
Array size (double precision) = 8000000
using 192 threads per block, 41667 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Scale_Optimized: Pass
device STREAM_Add: Pass
device STREAM_Add_Optimzied: Pass
device STREAM_Triad: Pass
device STREAM_Triad_Optimized: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 13534.4098 0.009485 0.009457 0.009522
Copy Opt: 13549.6302 0.009484 0.009447 0.009514
Scale: 13519.9083 0.009495 0.009468 0.009533
Scale Opt: 13536.9279 0.009485 0.009456 0.009519
Add: 13461.6371 0.014279 0.014263 0.014311
Add Opt: 13470.7648 0.014261 0.014253 0.014292
Triad: 13438.1505 0.014295 0.014288 0.014334
Triad Opt: 13465.9272 0.014275 0.014258 0.014316

[streamBenchmark] - results: PASSES
[/code]

Results taken using the CUDA 4.0 RC 2 release on a Fedora 15-alpha system ... the descrepancy of the double-precision STREAM results between the GTX 460 and GTX 460M (mobile) is somewhat curious.
Some new results from my brand new GTX 460M card (60GB/s peak bandwidth):



Single-precision:





./streamsp Starting...



Running on...



Device 0: GeForce GTX 460M

Array size (single precision) = 8000000

using 192 threads per block, 41667 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Scale_Optimized: Pass

device STREAM_Add: Pass

device STREAM_Add_Optimzied: Pass

device STREAM_Triad: Pass

device STREAM_Triad_Optimized: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 36338.5279 0.001765 0.001761 0.001769

Copy Opt: 37972.9998 0.001689 0.001685 0.001693

Scale: 36062.0260 0.001777 0.001775 0.001780

Scale Opt: 37688.2052 0.001701 0.001698 0.001704

Add: 45023.4090 0.002135 0.002132 0.002138

Add Opt: 47089.8466 0.002045 0.002039 0.002049

Triad: 45155.5597 0.002132 0.002126 0.002140

Triad Opt: 47114.2486 0.002045 0.002038 0.002049



[streamBenchmark] - results: PASSES



-----------------------------------------------------------




Double-precision:



./streamdp Starting...



Device 0: GeForce GTX 460M

Array size (double precision) = 8000000

using 192 threads per block, 41667 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Scale_Optimized: Pass

device STREAM_Add: Pass

device STREAM_Add_Optimzied: Pass

device STREAM_Triad: Pass

device STREAM_Triad_Optimized: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 13534.4098 0.009485 0.009457 0.009522

Copy Opt: 13549.6302 0.009484 0.009447 0.009514

Scale: 13519.9083 0.009495 0.009468 0.009533

Scale Opt: 13536.9279 0.009485 0.009456 0.009519

Add: 13461.6371 0.014279 0.014263 0.014311

Add Opt: 13470.7648 0.014261 0.014253 0.014292

Triad: 13438.1505 0.014295 0.014288 0.014334

Triad Opt: 13465.9272 0.014275 0.014258 0.014316



[streamBenchmark] - results: PASSES




Results taken using the CUDA 4.0 RC 2 release on a Fedora 15-alpha system ... the descrepancy of the double-precision STREAM results between the GTX 460 and GTX 460M (mobile) is somewhat curious.

#12
Posted 04/25/2011 02:48 AM   
[quote name='dpephd' date='24 April 2011 - 07:48 PM' timestamp='1303699690' post='1229918']
Some new results from my brand new GTX 460M card (60GB/s peak bandwidth):

...snip ...

Double-precision:
[code]
./streamdp Starting...

Device 0: GeForce GTX 460M
...snip ...
Function Rate (MB/s) Avg time Min time Max time
... snip ...
Triad: 13438.1505 0.014295 0.014288 0.014334
Triad Opt: 13465.9272 0.014275 0.014258 0.014316

[streamBenchmark] - results: PASSES
[/code]
[/quote]

I've installed the CUDA 4.0 RC2 SDK and recompiled my stream benchmark implementation. The single-precision results haven't changed, but I am now seeing substantially better double-precision stream benchmark results:

[code]
./streamdp Starting...

Device 0: GeForce GTX 460M
Array size (double precision) = 8000000
using 192 threads per block, 41667 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Scale_Optimized: Pass
device STREAM_Add: Pass
device STREAM_Add_Optimzied: Pass
device STREAM_Triad: Pass
device STREAM_Triad_Optimized: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 49670.9253 0.002592 0.002577 0.002596
Copy Opt: 50025.6394 0.002573 0.002559 0.002577
Scale: 49402.2294 0.002606 0.002591 0.002610
Scale Opt: 49777.2442 0.002590 0.002571 0.002594
Add: 49305.6104 0.003902 0.003894 0.003905
Add Opt: 49397.7549 0.003895 0.003887 0.003900
Triad: 49195.2450 0.003909 0.003903 0.003913
Triad Opt: 49336.8262 0.003900 0.003892 0.003903

[streamBenchmark] - results: PASSES
-----------------------------------------------------------
[/code]
[quote name='dpephd' date='24 April 2011 - 07:48 PM' timestamp='1303699690' post='1229918']

Some new results from my brand new GTX 460M card (60GB/s peak bandwidth):



...snip ...



Double-precision:



./streamdp Starting...



Device 0: GeForce GTX 460M

...snip ...

Function Rate (MB/s) Avg time Min time Max time

... snip ...

Triad: 13438.1505 0.014295 0.014288 0.014334

Triad Opt: 13465.9272 0.014275 0.014258 0.014316



[streamBenchmark] - results: PASSES






I've installed the CUDA 4.0 RC2 SDK and recompiled my stream benchmark implementation. The single-precision results haven't changed, but I am now seeing substantially better double-precision stream benchmark results:





./streamdp Starting...



Device 0: GeForce GTX 460M

Array size (double precision) = 8000000

using 192 threads per block, 41667 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Scale_Optimized: Pass

device STREAM_Add: Pass

device STREAM_Add_Optimzied: Pass

device STREAM_Triad: Pass

device STREAM_Triad_Optimized: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 49670.9253 0.002592 0.002577 0.002596

Copy Opt: 50025.6394 0.002573 0.002559 0.002577

Scale: 49402.2294 0.002606 0.002591 0.002610

Scale Opt: 49777.2442 0.002590 0.002571 0.002594

Add: 49305.6104 0.003902 0.003894 0.003905

Add Opt: 49397.7549 0.003895 0.003887 0.003900

Triad: 49195.2450 0.003909 0.003903 0.003913

Triad Opt: 49336.8262 0.003900 0.003892 0.003903



[streamBenchmark] - results: PASSES

-----------------------------------------------------------

#13
Posted 04/26/2011 02:52 PM   
I've converted streamsp.cu codes supplied above to double precision.
Results for C2050 are:

[Double-Precision Device-Only STREAM Benchmark implementation in CUDA]
./streamdp Starting...

Running on...

Device 0: Tesla C2050
Array size (double precision) = 2000000
using 128 threads per block, 15625 blocks
device STREAM_Copy: Pass
device STREAM_Copy_Optimized: Pass
device STREAM_Scale: Pass
device STREAM_Add: Pass
device STREAM_Triad: Pass
Function Rate (MB/s) Avg time Min time Max time
Copy: 121684.1021 0.000264 0.000263 0.000265
Copy Opt.: 122399.0072 0.000262 0.000261 0.000263
Scale: 121080.0277 0.000266 0.000264 0.000267
Add: 128161.3133 0.000375 0.000375 0.000375
Triad: 127888.1294 0.000376 0.000375 0.000376

[streamBenchmark] - results: PASSES

But I obtain some warning at streamdp building:
ptxas /tmp/tmpxft_00005ed3_00000000-2_streamdp.compute_10.ptx, line 80; warning : Double is not
supported. Demoting to float

Dear Douglas, could you pls supply here your dp version - to be sure, that I converted your source correctly ?

Mikhail
I've converted streamsp.cu codes supplied above to double precision.

Results for C2050 are:



[Double-Precision Device-Only STREAM Benchmark implementation in CUDA]

./streamdp Starting...



Running on...



Device 0: Tesla C2050

Array size (double precision) = 2000000

using 128 threads per block, 15625 blocks

device STREAM_Copy: Pass

device STREAM_Copy_Optimized: Pass

device STREAM_Scale: Pass

device STREAM_Add: Pass

device STREAM_Triad: Pass

Function Rate (MB/s) Avg time Min time Max time

Copy: 121684.1021 0.000264 0.000263 0.000265

Copy Opt.: 122399.0072 0.000262 0.000261 0.000263

Scale: 121080.0277 0.000266 0.000264 0.000267

Add: 128161.3133 0.000375 0.000375 0.000375

Triad: 127888.1294 0.000376 0.000375 0.000376



[streamBenchmark] - results: PASSES



But I obtain some warning at streamdp building:

ptxas /tmp/tmpxft_00005ed3_00000000-2_streamdp.compute_10.ptx, line 80; warning : Double is not

supported. Demoting to float



Dear Douglas, could you pls supply here your dp version - to be sure, that I converted your source correctly ?



Mikhail

#14
Posted 04/26/2011 05:40 PM   
[quote name='MikhailK' date='26 April 2011 - 10:40 AM' timestamp='1303839659' post='1230529']
I've converted streamsp.cu codes supplied above to double precision.

... snip ...
But I obtain some warning at streamdp building:
ptxas /tmp/tmpxft_00005ed3_00000000-2_streamdp.compute_10.ptx, line 80; warning : Double is not
supported. Demoting to float

Dear Douglas, could you pls supply here your dp version - to be sure, that I converted your source correctly ?

Mikhail
[/quote]

Hi Mikhail,

Your results look good ... i.e. not very much less than the max device bandwidth of 144 GB/s. Please remember that the SDK generates executables which can run on all CUDA capable devices and double precision floating point is not available on compute capability (CC) 1.0, 1.1, and 1.2 devices ... see appendix G.1 of the CUDA C Programming Guide (v. 3.2). So when generating CC 1.0 binaries, doubles are demoted to floats, just like the warning says.

Given you are using a Tesla C2050, this isn't a big deal unless for some perverse reason you would want to run a CC 1.0 binary on your Tesla C2050. (I'm not even sure this is even possible ....)

For the sake of completeness attached is my double precision implementation.

Doug
[quote name='MikhailK' date='26 April 2011 - 10:40 AM' timestamp='1303839659' post='1230529']

I've converted streamsp.cu codes supplied above to double precision.



... snip ...

But I obtain some warning at streamdp building:

ptxas /tmp/tmpxft_00005ed3_00000000-2_streamdp.compute_10.ptx, line 80; warning : Double is not

supported. Demoting to float



Dear Douglas, could you pls supply here your dp version - to be sure, that I converted your source correctly ?



Mikhail





Hi Mikhail,



Your results look good ... i.e. not very much less than the max device bandwidth of 144 GB/s. Please remember that the SDK generates executables which can run on all CUDA capable devices and double precision floating point is not available on compute capability (CC) 1.0, 1.1, and 1.2 devices ... see appendix G.1 of the CUDA C Programming Guide (v. 3.2). So when generating CC 1.0 binaries, doubles are demoted to floats, just like the warning says.



Given you are using a Tesla C2050, this isn't a big deal unless for some perverse reason you would want to run a CC 1.0 binary on your Tesla C2050. (I'm not even sure this is even possible ....)



For the sake of completeness attached is my double precision implementation.



Doug
Attachments

streamdp.cu

#15
Posted 04/27/2011 07:34 AM   
  1 / 2    
Scroll To Top