computeprof "active cycles" counter "active cycles" value doesn't make sense to
Hi,

I'm profiling my application using computeprof profiler supplied with cuda toolkit 4.0 (not nvvp which supplied with cuda 4.1), on Tesla C2075.
One of the profiler's counters is "active cycle", which is used to calculate IPC, average occupancy and SM efficiency.

I don't understand the value I get for "active cycle", according to profiler's help/guide, SM efficiency is calculated as "active cycles"/"elapsed clocks", I assume that "elapsed clocks" is "GPU TIME"/"GPU Cycle" = "GPU TIME" * "GPU Freq.", the profiler reports high SM efficiency (95%-100%), which means that "GPU TIME" * "GPU Freq." ~ "active cycles", but I get that "GPU TIME" * "GPU Freq." is about 2x higher.

Can anyone clear this point for me?

BTW - Tesla C2075 frequency is 1.15GHz (schedule's frequency), I assume this is the frequency used for calculation.

Thanks,
Natan
Hi,



I'm profiling my application using computeprof profiler supplied with cuda toolkit 4.0 (not nvvp which supplied with cuda 4.1), on Tesla C2075.

One of the profiler's counters is "active cycle", which is used to calculate IPC, average occupancy and SM efficiency.



I don't understand the value I get for "active cycle", according to profiler's help/guide, SM efficiency is calculated as "active cycles"/"elapsed clocks", I assume that "elapsed clocks" is "GPU TIME"/"GPU Cycle" = "GPU TIME" * "GPU Freq.", the profiler reports high SM efficiency (95%-100%), which means that "GPU TIME" * "GPU Freq." ~ "active cycles", but I get that "GPU TIME" * "GPU Freq." is about 2x higher.



Can anyone clear this point for me?



BTW - Tesla C2075 frequency is 1.15GHz (schedule's frequency), I assume this is the frequency used for calculation.



Thanks,

Natan

#1
Posted 05/05/2012 06:47 PM   
never tried to put these profiler numbers together, but here is a guess - pre-Kepler GPUs have two clock cycles, the higher is used by the arithmetic pipelines, the lower is used by the rest of the multiprocessor, such as instruction issue. The difference between them is factor of 2. May be active cycles are reported in the lower frequency.
never tried to put these profiler numbers together, but here is a guess - pre-Kepler GPUs have two clock cycles, the higher is used by the arithmetic pipelines, the lower is used by the rest of the multiprocessor, such as instruction issue. The difference between them is factor of 2. May be active cycles are reported in the lower frequency.

#2
Posted 05/07/2012 10:05 PM   
[quote name='vvolkov' date='07 May 2012 - 05:05 PM' timestamp='1336428357' post='1405385']
never tried to put these profiler numbers together, but here is a guess - pre-Kepler GPUs have two clock cycles, the higher is used by the arithmetic pipelines, the lower is used by the rest of the multiprocessor, such as instruction issue. The difference between them is factor of 2. May be active cycles are reported in the lower frequency.
[/quote]
The SM PM counters increment at the Graphics Clock == 1/2 Processor Clock.
[quote name='vvolkov' date='07 May 2012 - 05:05 PM' timestamp='1336428357' post='1405385']

never tried to put these profiler numbers together, but here is a guess - pre-Kepler GPUs have two clock cycles, the higher is used by the arithmetic pipelines, the lower is used by the rest of the multiprocessor, such as instruction issue. The difference between them is factor of 2. May be active cycles are reported in the lower frequency.



The SM PM counters increment at the Graphics Clock == 1/2 Processor Clock.

#3
Posted 05/09/2012 04:20 AM   
But the graphics clock of tesla c2075 is 1.15 GHz, doesn't it?
According to nVidia, this GPU can process up to 1 Tflops -> 1 Tflops = 16 SMs * 32 cores/SM * 1.15 GHz * 2 (IPC), which means the IPC is calculated relatively to 1.15 GHz frequency, which means that this is the lower frequency, ie. processor frequency is 2.3GHz.
But the graphics clock of tesla c2075 is 1.15 GHz, doesn't it?

According to nVidia, this GPU can process up to 1 Tflops -> 1 Tflops = 16 SMs * 32 cores/SM * 1.15 GHz * 2 (IPC), which means the IPC is calculated relatively to 1.15 GHz frequency, which means that this is the lower frequency, ie. processor frequency is 2.3GHz.

#4
Posted 05/10/2012 02:31 PM   
[quote name='natan88a' date='10 May 2012 - 08:31 AM' timestamp='1336660274' post='1406485']
But the graphics clock of tesla c2075 is 1.15 GHz, doesn't it?
According to nVidia, this GPU can process up to 1 Tflops -> 1 Tflops = 16 SMs * 32 cores/SM * 1.15 GHz * 2 (IPC), which means the IPC is calculated relatively to 1.15 GHz frequency, which means that this is the lower frequency, ie. processor frequency is 2.3GHz.
[/quote]

No, the clock rate for the CUDA cores is 1.15 GHz. The factor of 2 in the quoted FLOPS numbers is not an IPC factor. Each CUDA core pipeline finishes 1 instruction per clock (for most floating point instructions), but there happens to be one instruction (the fused multiply-add) that does two floating point operations at the same time. The theoretical 1.1 TFLOPS for the Tesla C2075 assumes that your instruction sequence is nothing but FMA instructions. In real programs, the throughput will be lower as other instructions do not perform 2 floating point operations at once.
[quote name='natan88a' date='10 May 2012 - 08:31 AM' timestamp='1336660274' post='1406485']

But the graphics clock of tesla c2075 is 1.15 GHz, doesn't it?

According to nVidia, this GPU can process up to 1 Tflops -> 1 Tflops = 16 SMs * 32 cores/SM * 1.15 GHz * 2 (IPC), which means the IPC is calculated relatively to 1.15 GHz frequency, which means that this is the lower frequency, ie. processor frequency is 2.3GHz.





No, the clock rate for the CUDA cores is 1.15 GHz. The factor of 2 in the quoted FLOPS numbers is not an IPC factor. Each CUDA core pipeline finishes 1 instruction per clock (for most floating point instructions), but there happens to be one instruction (the fused multiply-add) that does two floating point operations at the same time. The theoretical 1.1 TFLOPS for the Tesla C2075 assumes that your instruction sequence is nothing but FMA instructions. In real programs, the throughput will be lower as other instructions do not perform 2 floating point operations at once.

#5
Posted 05/13/2012 08:55 PM   
Wow, if that is true, I was way off.
so FMA instruction is counted as two instructions in the profiler? what about SCADD (shift + add) and MAD? are they also counted as two instructions?

Thanks!
Wow, if that is true, I was way off.

so FMA instruction is counted as two instructions in the profiler? what about SCADD (shift + add) and MAD? are they also counted as two instructions?



Thanks!

#6
Posted 05/14/2012 04:46 PM   
[quote name='natan88a' date='14 May 2012 - 10:46 AM' timestamp='1337013965' post='1408193']
Wow, if that is true, I was way off.
so FMA instruction is counted as two instructions in the profiler? what about SCADD (shift + add) and MAD? are they also counted as two instructions?
[/quote]

No, I don't think FMA is counted as two instructions in the profiler, but FMA is counted as two floating point operations (not instructions!) in NVIDIA marketing materials.
[quote name='natan88a' date='14 May 2012 - 10:46 AM' timestamp='1337013965' post='1408193']

Wow, if that is true, I was way off.

so FMA instruction is counted as two instructions in the profiler? what about SCADD (shift + add) and MAD? are they also counted as two instructions?





No, I don't think FMA is counted as two instructions in the profiler, but FMA is counted as two floating point operations (not instructions!) in NVIDIA marketing materials.

#7
Posted 05/14/2012 07:54 PM   
Thank you, you were very helpful.
Thank you, you were very helpful.

#8
Posted 05/15/2012 06:50 AM   
Scroll To Top