Hello,
“RAMGATE” seems to be going on at full storm.
Meanwhile I found Nai’s Benchmark code at this website:
I copied the text/source code, cleaned it up with textpad, copy & pasted helpers_math.h code into it. Studied it.
Then compiled/build and run it via visual studio 2010 and incremental linking disabled and cuda 6.5 toolkit, probably some beta installed.
My conclusion is the following:
-
The benchmarking tool behaves weirdly on my GT 520… it reports unbelieveable high numbers. So this is a clear indication something is wrong at least with visual studio and cuda 6.5.
-
Perhaps the division code is flawed… perhaps () must be added to make the divisions happen in the proper order, though this doesn’t seem to be the problem.
-
The benchmarking is too short… only 10 loops of 128 MB block is tested. Almost no GPU load is applied according to GPU-z.
-
The kernel code itself seems suspicious… all inputs are added to a single temporarely variable.
-
Perhaps Visual Studio Compiler or Cuda Compiler detects that the code doesn’t do anything usefull and simply removes any useless code.
-
I was not capable of viewing any generated PTX ? Is there a setting in Visual Studio that allows this ? I found this weird. I guess I could modify the command line parameters or so to generate PTX… but this should be by default at least for output files ?! weird ?! I feel sorry for C/C++ programmers to have to deal with this runtime crap… it’s nice to write fast simple cuda programs like this… except they have no idea what the hell is going on… PTX probably gets included inside the executable somewhere… How you know the kernel is actually loaded and running successfully ? I guess you don’t…
-
I do see the memory being allocated.
-
There does seem to be some slight gpu load activity but barely.
This makes me conclude there is something fishy going on with this benchmark… at least with visual studio 2010 and cuda toolkit 6.5 and/or my system… but this doesn’t surprise me at all… after my cuda turned into crap youtube video which also included opengl interaction.
Therefore I will also not be running his executable just in case he is trying to infect systems.
I am not saying his benchmark is totally flawed or anything like that… It’s not producing expected results on my C/C++ system as he posted it ?!?
I also tried changing BenchmarkCounts from 10 to 100 or 1000 this completely freaks out the benchmark… sometimes returning 0 or negative numbers.
I also tried changing the float to double where the gigabytes/second calculation is done.
I wonder if the rapid launch of multiple kernel calls is maybe affecting the output. (I do have many browsers open though… maybe that is interferring with cuda… or maybe cuda is just totally failing on gt 520… my own benchmark from long ago does seem to work a bit.).
Anyway maybe those rapid kernel calls or not syncing properly or whatever.
Or perhaps the copy & paste operation from html to text screwed somethng up… to me it doesn’t seem like it.
I write this posting to any interested cuda coder. It should be quite easy to copy & paste his source code from that german link I gave you.
Could you try out compiling/building his code on your system… the running it… examining results… maybe posting here…
And then later modify it a little bit like:
int BenchmarkCount = 1000;
It’d be interested if there is a difference in results if it is increased to 100 or 1000 or maybe even beyond that… but maybe the milliseconds will overflow… or maybe watchdog kicking in.
On my system at least the results are totally whack.
Perhaps later when I have some more time I may write my own benchmark… but might take a different approach just to make sure all the rules of the cuda/driver api architecture and such are followed… I will have to brush up a bit on my cuda programming skills… fortunately I can probably look at my old code…
Anyway some questions: Is his way of coding safe ? In other words:
Performing 10 kernel calls between event start and event stop ?
cudaEventRecord(start);
for (int j = 0; j < BenchmarkCount; j++)
BenchMarkDRAMKernel <<<BlockCount, BlockSize >>>(Pointers[i]);
cudaEventRecord(stop);
I have no further time to look into this any further right now… but maybe this loop should be around it and not in it… just some hints which may or may not be wrong.
I do believe there is a problem with gtx 790 though… because of many gamers mentioning stutter…
So let’s consider the issues I am having with this code and cuda 6.5 toolkit… something bizar by itself. Unless others or having problems as well re-creating his benchmark and re-creating believable results.
Just to clear here is the results of the current build from visual studio 2010:
Nai’s Benchmark
Allocating Memory . . . Chunk Size = 134217728 Byte
Press any key to continue . . .
Allocated 7 Chunks
Benchmarking DRAM
Press any key to continue . . .
DRAM-Bandwidth of 0. Chunk: 838860.812500 GByte/s
DRAM-Bandwidth of 1. Chunk: 699050.687500 GByte/s
DRAM-Bandwidth of 2. Chunk: 822412.562500 GByte/s
DRAM-Bandwidth of 3. Chunk: 822412.562500 GByte/s
DRAM-Bandwidth of 4. Chunk: 762600.750000 GByte/s
DRAM-Bandwidth of 5. Chunk: 806596.937500 GByte/s
DRAM-Bandwidth of 6. Chunk: 806596.937500 GByte/s
Press any key to continue . . .
Benchmarking L2-Cache
Press any key to continue . . .
L2-Cache-Bandwidth of 0. Chunk: 2567941.250000 GByte/s
L2-Cache-Bandwidth of 1. Chunk: 2567941.250000 GByte/s
L2-Cache-Bandwidth of 2. Chunk: 2567941.250000 GByte/s
L2-Cache-Bandwidth of 3. Chunk: 2621440.000000 GByte/s
L2-Cache-Bandwidth of 4. Chunk: 2567941.250000 GByte/s
L2-Cache-Bandwidth of 5. Chunk: 2621440.000000 GByte/s
L2-Cache-Bandwidth of 6. Chunk: 2621440.000000 GByte/s
Press any key to continue . . .
I don’t think my GT 520 can do 838860 GByte/s, do you ?
Maybe he updated his source code to correct errors… don’t know… don’t think so…
If you do believe this number is correct then hmmm…