computeprof "active cycles" counter "active cycles" value doesn't make sense to

Hi,

I’m profiling my application using computeprof profiler supplied with cuda toolkit 4.0 (not nvvp which supplied with cuda 4.1), on Tesla C2075.
One of the profiler’s counters is “active cycle”, which is used to calculate IPC, average occupancy and SM efficiency.

I don’t understand the value I get for “active cycle”, according to profiler’s help/guide, SM efficiency is calculated as “active cycles”/“elapsed clocks”, I assume that “elapsed clocks” is “GPU TIME”/“GPU Cycle” = “GPU TIME” * “GPU Freq.”, the profiler reports high SM efficiency (95%-100%), which means that “GPU TIME” * “GPU Freq.” ~ “active cycles”, but I get that “GPU TIME” * “GPU Freq.” is about 2x higher.

Can anyone clear this point for me?

BTW - Tesla C2075 frequency is 1.15GHz (schedule’s frequency), I assume this is the frequency used for calculation.

Thanks,
Natan

never tried to put these profiler numbers together, but here is a guess - pre-Kepler GPUs have two clock cycles, the higher is used by the arithmetic pipelines, the lower is used by the rest of the multiprocessor, such as instruction issue. The difference between them is factor of 2. May be active cycles are reported in the lower frequency.

The SM PM counters increment at the Graphics Clock == 1/2 Processor Clock.

But the graphics clock of tesla c2075 is 1.15 GHz, doesn’t it?
According to nVidia, this GPU can process up to 1 Tflops → 1 Tflops = 16 SMs * 32 cores/SM * 1.15 GHz * 2 (IPC), which means the IPC is calculated relatively to 1.15 GHz frequency, which means that this is the lower frequency, ie. processor frequency is 2.3GHz.

No, the clock rate for the CUDA cores is 1.15 GHz. The factor of 2 in the quoted FLOPS numbers is not an IPC factor. Each CUDA core pipeline finishes 1 instruction per clock (for most floating point instructions), but there happens to be one instruction (the fused multiply-add) that does two floating point operations at the same time. The theoretical 1.1 TFLOPS for the Tesla C2075 assumes that your instruction sequence is nothing but FMA instructions. In real programs, the throughput will be lower as other instructions do not perform 2 floating point operations at once.

Wow, if that is true, I was way off.
so FMA instruction is counted as two instructions in the profiler? what about SCADD (shift + add) and MAD? are they also counted as two instructions?

Thanks!

No, I don’t think FMA is counted as two instructions in the profiler, but FMA is counted as two floating point operations (not instructions!) in NVIDIA marketing materials.

Thank you, you were very helpful.