any way to know on which SM a thread is running?

hyqneuron · November 26, 2010, 5:42pm

Is there any way by which I would know on which SM exactly a thread is running? Actually I just want to confirm that all the SMs have some work to do.

hyqneuron · November 26, 2010, 5:42pm

Is there any way by which I would know on which SM exactly a thread is running? Actually I just want to confirm that all the SMs have some work to do.

E.D_Riedijk · November 26, 2010, 7:37pm

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

And you only need to call it in one thread of your block.

E.D_Riedijk · November 26, 2010, 7:37pm

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

And you only need to call it in one thread of your block.

hyqneuron · November 26, 2010, 9:04pm

Well… nice code there. But all the output I’m getting for the smid is just “1”. I have 8 blocks, so it gives me eight “1”.

hyqneuron · November 26, 2010, 9:04pm

Well… nice code there. But all the output I’m getting for the smid is just “1”. I have 8 blocks, so it gives me eight “1”.

jmcarval2 · November 28, 2010, 10:56am

Riedijk’s code works flawlessly on my linux boxes, with

8600GT and CUDA 3.2

GTX460 and CUDA 3.2

GTX480 and CUDA 3.2

and also on the old CUDA

GTX280 and CUDA 2.3

Try the code below, compiling it with nvcc code.cu and executing it with ./a.out 100

If it fails, describe your hardware, OS and CUDA stack

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

/* E.D. Riedijk */

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

__global__ void kern(int *sm){

   if (threadIdx.x==0)

      sm[blockIdx.x]=get_smid();

}

int main(int argc, char *argv[]){

   int N = atoi(argv[1]);

   int *sm, *sm_d;

   sm = (int *) malloc(N*sizeof(*sm));

   cudaMalloc((void**)&sm_d,N*sizeof(*sm_d));

   kern<<<N,N>>>( sm_d);

   cudaMemcpy(sm, sm_d, N*sizeof(int), cudaMemcpyDeviceToHost);

   for (int i=0;i<N;i++)

      printf("%d %d\n",i,sm[i]);

return 0;

}

jmcarval2 · November 28, 2010, 10:56am

Riedijk’s code works flawlessly on my linux boxes, with

8600GT and CUDA 3.2

GTX460 and CUDA 3.2

GTX480 and CUDA 3.2

and also on the old CUDA

GTX280 and CUDA 2.3

Try the code below, compiling it with nvcc code.cu and executing it with ./a.out 100

If it fails, describe your hardware, OS and CUDA stack

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

/* E.D. Riedijk */

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

__global__ void kern(int *sm){

   if (threadIdx.x==0)

      sm[blockIdx.x]=get_smid();

}

int main(int argc, char *argv[]){

   int N = atoi(argv[1]);

   int *sm, *sm_d;

   sm = (int *) malloc(N*sizeof(*sm));

   cudaMalloc((void**)&sm_d,N*sizeof(*sm_d));

   kern<<<N,N>>>( sm_d);

   cudaMemcpy(sm, sm_d, N*sizeof(int), cudaMemcpyDeviceToHost);

   for (int i=0;i<N;i++)

      printf("%d %d\n",i,sm[i]);

return 0;

}

E.D_Riedijk · November 28, 2010, 9:59pm

that might very well be the case if your kernel doesn’t use a lot of registers and shared memory. 8 blocks can run concurrently on one multiprocessor, try with more blocks External Image

benetion · April 14, 2011, 9:08pm

I am running my card (1.1) which has 4 SM with CUDA 3.2, and get the following:

$ ./a.out 4

0 0

1 4

2 1

3 5

Could anybody explain to me why those four numbers? Thanks!

Riedijk’s code works flawlessly on my linux boxes, with

8600GT and CUDA 3.2

GTX460 and CUDA 3.2

GTX480 and CUDA 3.2

and also on the old CUDA

GTX280 and CUDA 2.3

Try the code below, compiling it with nvcc code.cu and executing it with ./a.out 100

If it fails, describe your hardware, OS and CUDA stack
#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

/* E.D. Riedijk */

__device__ uint get_smid(void) {

     uint ret;

     asm("mov.u32 %0, %smid;" : "=r"(ret) );

     return ret;

}

__global__ void kern(int *sm){

   if (threadIdx.x==0)

      sm[blockIdx.x]=get_smid();

}

int main(int argc, char *argv[]){

   int N = atoi(argv[1]);

   int *sm, *sm_d;

   sm = (int *) malloc(N*sizeof(*sm));

   cudaMalloc((void**)&sm_d,N*sizeof(*sm_d));

   kern<<<N,N>>>( sm_d);

   cudaMemcpy(sm, sm_d, N*sizeof(int), cudaMemcpyDeviceToHost);

   for (int i=0;i<N;i++)

      printf("%d %d\n",i,sm[i]);

return 0;

}

hyqneuron · April 15, 2011, 3:34pm

Hi benetion
cudaMemcpy(sm, sm_d, N*sizeof(int), cudaMemcpyDeviceToHost);
Lacks a synchronization after it. Add cudaThreadSynchronize() to see if there’s any change. Though I doubt this is actually the reason that the smids appear to be incorrect for you. Maybe you should set N equals to at least twenty and report back the result.

benetion · April 15, 2011, 3:44pm

Thanks for the suggestion. Adding cudaThreadSynchronize() does not help.

$ ./a.out 4

0 0

1 4

2 1

3 5

$ ./a.out 20

0 0

1 4

2 1

3 5

4 0

5 4

6 1

7 5

8 0

9 4

10 1

11 5

12 0

13 4

14 1

15 5

16 0

17 4

18 1

19 5

hyqneuron · April 15, 2011, 4:22pm

Well, clearly you are having 4 SMs, just that the labelling seems weird. 0, 1, 4 and 5. Nvidia seems to have a consistent problem with their SM management. I was having some even weirder result with block launching on a GTX 460. Perhaps you just have to live with that weird labelling. Or, maybe somebody else could prove me wrong.

E.D_Riedijk · April 16, 2011, 5:09am

GTX 285 also has numbers higher than 29. There are also some gaps in the reported numbers there. I just made a translation table from the smids to an array of 0…29

Sylvain_Collange · April 16, 2011, 9:20am

SMID is a bitfield. On Tesla with CUDA 3.x/4.0, the 2 low-order bits indicate the SM number within a TPC, and the high-order bits point to the TPC number.

So SMID=4 means TPC 1, SM 0.

The code above is compiled to:

S2R R0, SR0;                    // Read PhysID (http://forums.nvidia.com/index.php?showtopic=45458)

LOP.AND R1, R0, c [0x1] [0x0];  // Extract TPC ID (c[1][0]=0x0ff00000)

LOP.AND R2, R0, c [0x1] [0x1];  // Extract SM ID  (c[1][1]=0x000f0000)

SHR R1, R1, 0x12;               // Align to lsb+2

SHR R2, R2, 0x10;               // Align to lsb

LOP.OR R1, R1, R2;              // OR them together

Note that G9x and later GPUs do “breadth-first” scheduling, by first scattering the workload across the first SM of multiple TPCs, then filling the other SMs, and then finally scheduling multiple blocks per SM.

hyqneuron · April 16, 2011, 10:02am

Hi Sylvain

Thanks for the clarification! No wonder %smid is incorrect for the non-Fermi cards. ptxas is such a lazy boy! Doesn’t even take into account the number of SMs under each TPC for CC1.x when calculating the smid.

Do you have more info on SRx and the content of c?

Sylvain_Collange · April 16, 2011, 12:15pm

I do not see why %smid is incorrect… It behaves as specified in the documentation. The PTX manual even warns that “The SM identifier numbering is not guaranteed to be contiguous.”

c are just locations in const memory that cuobjdump does not display by default (it does with the -elf option).

Or with good old Decuda:

mov.b32 $r0, %physid

and.b32 $r1, $r0, c1[0x0000]

and.b32 $r2, $r0, c1[0x0004]

shr.u32 $r1, $r1, 0x00000012

shr.u32 $r2, $r2, 0x00000010

or.b32 $r1, $r1, $r2

// segment: const (1:0000)

0000: 0ff00000 000f0000

SRx are described in decuda’s README. They are mostly the same as (a subset of) PTX special registers.

ghandurah · April 25, 2011, 8:19am

I’m not familiar with the PTX stuff, but I’ve done some search and found out that %smid is a read-only predefined variable, so is there a way to manipulate to which SM a block will go? or the only way is to use the queried smid to distribute the work somehow?

Thanks

hyqneuron · April 25, 2011, 9:22am

This is the easiest way.

Sarnath · April 25, 2011, 11:42am

The secret behind writing good CUDA code is to not to attempt to know murky/underground/undocumented things.