cost of clock64()

Hi,

We use clock64() to measure the time (in clock cycles) of some code in kernel. But I noticed the call clock64() itself is costly. So wondering what’s the cost of the call clcok64() itself? and how to minimize it?

thanks in advance.

How did you establish that calls to clock64() are costly? And exactly how costly is “costly”? As far as I am aware, the device-side clock functionality is implemented as reads from special registers, which is about as fast as it gets. The 64-bit variant needs a short emulation sequence to provide for the possibility of carry propagation between the individually read high and low words of the counter. You can see the details when you disassemble the binary executable into SASS with cuobjdump --dump-sass.

If the sequences to be timed are short, my testing suggests to me that the use of clock() instead of clock64() can be about twice as fast. The cost in clock cycles will vary depending on the GPU.

Intuitively that makes sense, because a call to clock() only requires one read of the special register clock_lo (so one S2R instruction), while a call to clock64() requires three consecutive reads from special registers (clock_hi, clock_lo, clock_hi) followed by a check for carry-propagation between the two halves of the counter.

Thanks for the reply.

I was trying to measure the time spent on a piece of code inside my kernel, the simplified version is like this:

...

total_time = 0;
c_start = clock64();
while ( some_condition… ) {
c_1 = clock64(); (1)
my_code_to_be_measureed…
c_2 = clock64(); (2)
total_time += ( c_2 - c_1 ); (3)
}
c_end = clock64()

My goal is to get “total_time”.

If I comment out lines 1/2/3, I get used_time = c_end - c_start; but if I uncomment out lines 1/2/3, then my used_time ( which is c_end - c_start ) is much bigger than its value with lines 1/2/3 commented out, so I assumed the calls to clock64() on line 1/2 are expensive. (unless line 3 is expensive?)

Note in my real code, I don’t need c_start and c_end, I added them only for testing purpose to see if lines 1/2/3 are expensive.

do you know how many clock cycles will clock64() itself cost?

clock64() will require 3 registers and total_time will require 5 registers. Total instruction count will be ~10 per loop. This is too expensive. clock return clock_t which is also not ideal.

If you need to measure inside the loop then your best option is to write inline ptx to read %clock into a uint32_t to limit your processing to 32-bit ints.

If my_code_to_be_measured… is small then you may want to try measuring outside the loop first. After measuring this remove as much of the code in the loop (either my_code_to_be_measured or the other code) and then take the difference. If you do try elimination then make sure you diff the SASS code (nvdisasm) to make sure that the compiler didn’t dead strip a lot more than what you intended.

In the case of calls to clock64() inside the loop, it is also conceivable that these calls interfere with compiler optimizations otherwise applied to the loop. Check the SASS (machine code) with cuobjdump to see whether this is the case.

FWIW I don’t see any SASS difference between:

unsigned temp = clock();

and:

static device inline uint32_t __myclock(){
uint32_t mclk;
asm volatile(“mov.u32 %0, %%clock;” : “=r”(mclk));
return mclk;}

Here’s a sample test case:

$ cat t810.cu
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#define DSIZE 4

static __device__ __inline__ uint32_t __myclock(){
  uint32_t mclk;
  asm volatile("mov.u32 %0, %%clock;" : "=r"(mclk));
  return mclk;}

__global__ void kernel(long *data){

  unsigned temp1 = clock();
  unsigned temp2 = clock();
//  long temp2 = 0;
  unsigned temp3 = clock();
  data[0] = temp3-temp1;
  data[1] = temp2;
  uint32_t temp4 = __myclock();
  uint32_t temp5 = __myclock();
  uint32_t temp6 = __myclock();
  data[2] = temp6-temp4;
  data[3] = temp5;
}

int main(){

  long hdata[DSIZE];
  long *ddata;
  cudaMalloc(&ddata, DSIZE*sizeof(long));
  kernel<<<1,1>>>(ddata);
  cudaMemcpy(hdata, ddata, DSIZE*sizeof(long), cudaMemcpyDeviceToHost);
  printf("elapsed: %ld\n", hdata[0]);
  return 0;
}

$ nvcc -o t810 t810.cu
$ cuobjdump -sass t810

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_20

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_20
                Function : _Z6kernelPl
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];  /* 0x2800440400005de4 */
        /*0008*/         MOV R2, c[0x0][0x20];   /* 0x2800400080009de4 */
        /*0010*/         MOV R3, c[0x0][0x24];   /* 0x280040009000dde4 */
        /*0018*/         S2R R0, SR_CLOCKLO;     /* 0x2c00000140001c04 */
        /*0020*/         SHL.W R0, R0, 0x1;      /* 0x6000c00004001e03 */
        /*0028*/         S2R R4, SR_CLOCKLO;     /* 0x2c00000140011c04 */
        /*0030*/         SHL.W R6, R4, 0x1;      /* 0x6000c00004419e03 */
        /*0038*/         S2R R4, SR_CLOCKLO;     /* 0x2c00000140011c04 */
        /*0040*/         SHL.W R4, R4, 0x1;      /* 0x6000c00004411e03 */
        /*0048*/         ISUB R4, R4, R0;        /* 0x4800000000411d03 */
        /*0050*/         MOV R5, RZ;             /* 0x28000000fc015de4 */
        /*0058*/         MOV R7, RZ;             /* 0x28000000fc01dde4 */
        /*0060*/         ST.E.64 [R2], R4;       /* 0x9400000000211ca5 */
        /*0068*/         ST.E.64 [R2+0x8], R6;   /* 0x9400000020219ca5 */
        /*0070*/         S2R R0, SR_CLOCKLO;     /* 0x2c00000140001c04 */
        /*0078*/         SHL.W R0, R0, 0x1;      /* 0x6000c00004001e03 */
        /*0080*/         S2R R4, SR_CLOCKLO;     /* 0x2c00000140011c04 */
        /*0088*/         SHL.W R6, R4, 0x1;      /* 0x6000c00004419e03 */
        /*0090*/         S2R R4, SR_CLOCKLO;     /* 0x2c00000140011c04 */
        /*0098*/         SHL.W R4, R4, 0x1;      /* 0x6000c00004411e03 */
        /*00a0*/         ISUB R4, R4, R0;        /* 0x4800000000411d03 */
        /*00a8*/         MOV R5, RZ;             /* 0x28000000fc015de4 */
        /*00b0*/         MOV R7, RZ;             /* 0x28000000fc01dde4 */
        /*00b8*/         ST.E.64 [R2+0x10], R4;  /* 0x9400000040211ca5 */
        /*00c0*/         ST.E.64 [R2+0x18], R6;  /* 0x9400000060219ca5 */
        /*00c8*/         EXIT;                   /* 0x8000000000001de7 */
                ............................



Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

I’m not sure why the shift left by one bit is in there.

Just a guess regarding the left shift: sm_2x uses a hot clock for the CUDA cores that runs at 2x the base clock. If the clock counter increments at the base clock frequency, that would explain the factor of two. Do you see this doubling in SASS generated for sm_3x or sm_5x?

As to why this doubling is accomplished by left shift rather than by add, as I recall on sm_2x these have the same throughput, only on later architectures would add be preferred over shift.

Genius, as usual. Compiling for -arch=sm_35 gives:

$ cuobjdump -sass t810

Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_35

Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_35
                Function : _Z6kernelPl
        .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                           /* 0x0880000000001000 */
        /*0008*/                   MOV R1, c[0x0][0x44];   /* 0x64c03c00089c0006 */
        /*0010*/                   MOV R4, c[0x0][0x140];  /* 0x64c03c00281c0012 */
        /*0018*/                   MOV R5, c[0x0][0x144];  /* 0x64c03c00289c0016 */
        /*0020*/                   S2R R0, SR_CLOCKLO;     /* 0x86400000281c0002 */
        /*0028*/                   S2R R6, SR_CLOCKLO;     /* 0x86400000281c001a */
        /*0030*/                   S2R R3, SR_CLOCKLO;     /* 0x86400000281c000e */
        /*0038*/                   ISUB R2, R3, R0;        /* 0xe0880000001c0c0a */
                                                           /* 0x0800000181b8a010 */
        /*0048*/                   MOV R7, RZ;             /* 0xe4c03c007f9c001e */
        /*0050*/                   MOV R3, RZ;             /* 0xe4c03c007f9c000e */
        /*0058*/                   ST.E.64 [R4], R2;       /* 0xe5800000001c1008 */
        /*0060*/                   ST.E.64 [R4+0x8], R6;   /* 0xe5800000041c1018 */
        /*0068*/                   S2R R0, SR_CLOCKLO;     /* 0x86400000281c0002 */
        /*0070*/                   S2R R6, SR_CLOCKLO;     /* 0x86400000281c001a */
        /*0078*/                   S2R R3, SR_CLOCKLO;     /* 0x86400000281c000e */
                                                           /* 0x0801b811b8a08010 */
        /*0088*/                   ISUB R2, R3, R0;        /* 0xe0880000001c0c0a */
        /*0090*/                   MOV R7, RZ;             /* 0xe4c03c007f9c001e */
        /*0098*/                   MOV R3, RZ;             /* 0xe4c03c007f9c000e */
        /*00a0*/                   ST.E.64 [R4+0x10], R2;  /* 0xe5800000081c1008 */
        /*00a8*/                   ST.E.64 [R4+0x18], R6;  /* 0xe58000000c1c1018 */
        /*00b0*/                   EXIT;                   /* 0x18000000001c003c */
        /*00b8*/                   BRA 0xb8;               /* 0x12007ffffc1c003c */
                ............................

Fatbin ptx code:
================
arch = sm_35
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

And, I see the shortest back-to-back clock times using the above code on a cc3.5 device. (no surprise) Again, no apparent SASS difference between shoving clock() into an unsigned vs. the inline ptx

Thanks all, I appreciate your responses.

the piece of code that I wanted to measure is actually moving data to and from global memory to shared memory (meaning both read and write). I wonder if there is a better way to do it?

I think it’s a little complicated due to memory coaleaced access in multi-threaded program. I read the programming guide, seems we need to consider the memory segments (for example, for short integers, each segment is 32 short integers or 64 bytes), so it varies based on the data type and data access pattern. I wonder if there is an easier and more general way to do it (that’s why I get the clock before and after the memory access and take the difference).

If your goal is to assess the efficiency of the data transfers, I would suggest using the CUDA profiler to do this, rather than instrumenting your code with calls to clock() or clock64().

There is not a good method to measure the latency of blocks of device code. The CUDA profilers do not support this form of measurement. Your current approach of using clock() is the best option.

True enough. But I would claim that measuring the latency of anything on a GPU does not make much sense in terms of trying to improve application performance. GPUs use a throughput-oriented architecture, and the CUDA profiler does a pretty good job of highlighting the bottlenecks in that context, including the efficient use of various forms of memory. This is especially true for modern architectures with better HW hooks for the profiler to exploit.

I would further claim that CPU optimization efforts with in high-performance contexts is heading in exactly the same direction, by becoming throughput oriented. It has been many years since I have last been worried about the latency of anything on x86 platforms, whereas I was (and had to be) keenly aware of latencies back in the days of Pentium MMX.

Hello!

I am trying to calculate the minimum synchronization cost for 256 threads per block, by varying the number of blocks. I am doing a very simple global write and storing the cycle count when memory operation starts, when synchronization starts and when synchronization ends. Max sync cost = (max cycle count from the sync counts obtained when sync ends) - (min cycle count from the sync counts obtained when sync starts).
When I do this, the difference I obtain is very large(8525002 cycles for 10 blocks, 256 threads per block, Fermi architecture). I am running this on a Fermi architecture card(2.0). I am not sure where I am going wrong.

Here is my kernel code:
global void global_mem_write_kernel(float* data_dev, unsigned int * start_dev, unsigned int * mid_dev, unsigned int * end_dev) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int start_reg, mid_reg, end_reg;
//Start time for mem-op
asm volatile(“mov.u32 %0, %%clock;” : “=r”(start_reg));

//memory write
data_dev[index] = index;

//Synchronization start time and Mem-op end time
asm volatile(“mov.u32 %0, %%clock;” : “=r”(mid_reg));

__syncthreads();

//Synchronization end time
asm volatile(“mov.u32 %0, %%clock;” : “=r”(end_reg));

start_dev[index] = start_reg;
mid_dev[index] = mid_reg;
end_dev[index] = end_reg;
}

in CPU, I am calculating the max cost as:
max_sync cost = end_max - mid_min

Thanks in advance!

Is your device large enough so that all blocks can run in parallel? If not, you will naturally pick up a large difference between maximum and minimum.
Even if all blocks can run in parallel, clocks of different multiprocessors are not synchronized..
So you need to calc your max sync cost for each block separately and then take the maximum of all blocks. Otherwise your result will be dominated by the maximum clock offset between the multiprocessors

Thanks! It did help. I captured the sync times for each block separately.
However, even though I am performing very basic global memory write and shared memory write, shared memory takes more time than global memory. Do you know what could be the potential reason? Any suggestion on how I could try to see the advantages of shared memory being faster than global by any code optimizations? I am just writing into shared memory and not using the shared memory to update back to the global as I just need to capture the time for shared memory. here is my shared memory kernel:

global void shared_mem_write_kernel(unsigned int * start_dev, unsigned int * mid_dev, unsigned int * end_dev, unsigned int * b_id) {

[b]__shared__ float data_s[1024];[/b]
//Index of the current thread

//1D grid 1D block
int index = threadIdx.x + blockIdx.x * blockDim.x;
int id2 = threadIdx.x;
unsigned int start_reg, mid_reg, end_reg;
float val;
//Start time for mem-op
asm volatile("mov.u32 %0, %%clock;" : "=r"(start_reg));
//start_reg = clock();
//memory write
[b]data_s[id2] = index;[/b]

//Synchronization start time and Mem-op end time
asm volatile("mov.u32 %0, %%clock;" : "=r"(mid_reg));
//mid_reg = clock();
__syncthreads();

//Synchronization end time
asm volatile("mov.u32 %0, %%clock;" : "=r"(end_reg));
//end_reg = clock();
start_dev[index] = start_reg;
mid_dev[index] = mid_reg;                                                                                                     
end_dev[index] = end_reg;
b_id[index] = blockIdx.x;

}

For your propose, I suggest you using native binary instead of cuda C; For Maxwell Architecture GPU , you can you maxas(GitHub - NervanaSystems/maxas: Assembler for NVIDIA Maxwell architecture), and for kepler, here is a online compiler for you: https://hpc.aliyun.com/tools/assembler?spm=0.0.0.0.7CKjKC
(you can take a look at this post: Kepler Assembler - CUDA Programming and Performance - NVIDIA Developer Forums)

It seems to me like you are expecting to measure the time it takes the data to hit global memory. This is not the case however, execution of each thread continues immediately with the instructions after the write, while it can take many hundreds or thousands of cycles for the data to be transported to global memory.

You stand a better chance of seeing the time needed to write to global memory if you insert a __threadfence_system() call after the write and before the syncthreads.

Note however that the documentation of the threadfence functions is deliberately vague mentioning only ordering of memory accesses, rather than the memory access actually having been performed and being visible from all threads on the device.
Particularly there is no guarantee the data is in off-chip memory after a __threadfence_system() call, as all memory accesses (even including those from the host) go via the L2 cache and it is thus sufficient to ensure data is placed in L2.