I have a short device program which loads %clock before and after some code and reports the difference.
Here’s the program:
if ( num )
{
asm volatile (
“.reg .pred %p;\n\t”
“mov.u32 %0, %%clock;\n\t”
“loop1:\n\t”
“sub.u32 %2, %2, 1;\n\t”
//“sub.u32 %2, %2, 0;\n\t”
//“sub.u32 %2, %2, 0;\n\t”
//“sub.u32 %2, %2, 0;\n\t”
“setp.ne.s32 %p, %2, 0;\n\t”
“@%p bra.uni loop1;\n\t”
"mov.u32 %1, %%clock;"
:
"=r"(start),
"=r"(stop)
:
"r" (num));
}
else
{
asm volatile (
""
"mov.u32 %0, %%clock;\n\t"
"mov.u32 %1, %%clock;"
:
"=r"(start),
"=r"(stop)
:
"r" (n));
}
Results:
Timer 0: 51 (cycles)
Timer 10: 890 (cycles)
Timer 20: 1726 (cycles)
Timer 30: 2546 (cycles)
Timer 40: 3375 (cycles)
Timer 50: 4219 (cycles)
Timer 60: 5040 (cycles)
Timer 70: 5866 (cycles)
Timer 80: 6706 (cycles)
Timer 90: 7527 (cycles)
Timer 100: 8356 (cycles)
Results with 1, 2, or 3 of the sub instructions uncommented:
Timer 0: 51 (cycles)
Timer 10: 1060 (cycles)
Timer 20: 2060 (cycles)
Timer 30: 3063 (cycles)
Timer 40: 4060 (cycles)
Timer 50: 5060 (cycles)
Timer 60: 6056 (cycles)
Timer 70: 7063 (cycles)
Timer 80: 8060 (cycles)
Timer 90: 9060 (cycles)
Timer 100: 10060 (cycles)
Timer 0: 51 (cycles)
Timer 10: 1226 (cycles)
Timer 20: 2405 (cycles)
Timer 30: 3569 (cycles)
Timer 40: 4731 (cycles)
Timer 50: 5905 (cycles)
Timer 60: 7076 (cycles)
Timer 70: 8246 (cycles)
Timer 80: 9415 (cycles)
Timer 90: 10596 (cycles)
Timer 100: 11768 (cycles)
Timer 0: 51 (cycles)
Timer 10: 1406 (cycles)
Timer 20: 2737 (cycles)
Timer 30: 4089 (cycles)
Timer 40: 5419 (cycles)
Timer 50: 6759 (cycles)
Timer 60: 8111 (cycles)
Timer 70: 9442 (cycles)
Timer 80: 10792 (cycles)
Timer 90: 12119 (cycles)
Timer 100: 13467 (cycles)
What I read from this is:
- Two %clock reads in a row differ by 51 consistently.
- Executing the loop takes about 84 per iteration.
- Each extra sub.32 takes about 16 or 17 per iteration.
I have a GeForce GT 710, running with clock at 135 MHz, which quickly ramps up to 953 MHz. If I run the test while the clock is already at 953 MHz, the results are the same.
Is %clock actually counting real GPU cycles, and if so, why does the program take so long to run? Or is it counting in faster units than the clock cycle?
What I want to be able to do is determine exactly (or on average, at least) the difference in cycles from one point to another in the program. Any suggestions for a different mechanism would be welcome.
Thanks.