execution ID
Hi everyone,

I'm quite new with the CUDA programming and I have a question. Is it possible to have an identifier of the current thread running on the device. What would be best for me, is that this identifier would be unique across blocks (but not needed across devices) and would be between 0 and the maximum number of threads that could be run on the device.

The reason for this is the following:
I'm trying to transform a sequential algorithm on CUDA. In this algorithm, I need to update counters. What I would like to do, is that each thread have its own counter. At the end, I would just need to compute the sum of the counters and everything would be fine. It means that if I have k counters, I would create a matrix containing k*nbthreads counters. The only problem is that I need a lot of counters (~60000) and the number of threads can be quite important also (much more than the maximum of threads allowed on the device). This is why I was thinking that if I could make a mapping between each thread and an identifier between 0 and the maximum number of threads, my matrix would be much smaller and could fit on the memory of the device.

Do you know how I can have such identifier? Or do you have any hint that would avoid using such identifier?

Thanks a lot,

Benoit
Hi everyone,



I'm quite new with the CUDA programming and I have a question. Is it possible to have an identifier of the current thread running on the device. What would be best for me, is that this identifier would be unique across blocks (but not needed across devices) and would be between 0 and the maximum number of threads that could be run on the device.



The reason for this is the following:

I'm trying to transform a sequential algorithm on CUDA. In this algorithm, I need to update counters. What I would like to do, is that each thread have its own counter. At the end, I would just need to compute the sum of the counters and everything would be fine. It means that if I have k counters, I would create a matrix containing k*nbthreads counters. The only problem is that I need a lot of counters (~60000) and the number of threads can be quite important also (much more than the maximum of threads allowed on the device). This is why I was thinking that if I could make a mapping between each thread and an identifier between 0 and the maximum number of threads, my matrix would be much smaller and could fit on the memory of the device.



Do you know how I can have such identifier? Or do you have any hint that would avoid using such identifier?



Thanks a lot,



Benoit

#1
Posted 05/02/2012 03:07 PM   
[quote name='bhoessen' date='02 May 2012 - 04:07 PM' timestamp='1335971262' post='1403438']
Hi everyone,

I'm quite new with the CUDA programming and I have a question. Is it possible to have an identifier of the current thread running on the device. What would be best for me, is that this identifier would be unique across blocks (but not needed across devices) and would be between 0 and the maximum number of threads that could be run on the device.

The reason for this is the following:
I'm trying to transform a sequential algorithm on CUDA. In this algorithm, I need to update counters. What I would like to do, is that each thread have its own counter. At the end, I would just need to compute the sum of the counters and everything would be fine. It means that if I have k counters, I would create a matrix containing k*nbthreads counters. The only problem is that I need a lot of counters (~60000) and the number of threads can be quite important also (much more than the maximum of threads allowed on the device). This is why I was thinking that if I could make a mapping between each thread and an identifier between 0 and the maximum number of threads, my matrix would be much smaller and could fit on the memory of the device.

Do you know how I can have such identifier? Or do you have any hint that would avoid using such identifier?

Thanks a lot,

Benoit
[/quote]

Hello,

In each block each thread has a unique identifier threadIdx which has 3 coordinates .x .y .z. Each block has a unique identifier blockIdx with also 3 coordinates .x .y .z.

If you submit a kernel with the following <<<<Nblocks,tpbl>>> then in the kernel you obtaine and unique number: [code] int idx=threadIx.x + blockIdx.x*blockDim.x; [/code] where blockDim.x is the size of threads per block (tpbl).
[quote name='bhoessen' date='02 May 2012 - 04:07 PM' timestamp='1335971262' post='1403438']

Hi everyone,



I'm quite new with the CUDA programming and I have a question. Is it possible to have an identifier of the current thread running on the device. What would be best for me, is that this identifier would be unique across blocks (but not needed across devices) and would be between 0 and the maximum number of threads that could be run on the device.



The reason for this is the following:

I'm trying to transform a sequential algorithm on CUDA. In this algorithm, I need to update counters. What I would like to do, is that each thread have its own counter. At the end, I would just need to compute the sum of the counters and everything would be fine. It means that if I have k counters, I would create a matrix containing k*nbthreads counters. The only problem is that I need a lot of counters (~60000) and the number of threads can be quite important also (much more than the maximum of threads allowed on the device). This is why I was thinking that if I could make a mapping between each thread and an identifier between 0 and the maximum number of threads, my matrix would be much smaller and could fit on the memory of the device.



Do you know how I can have such identifier? Or do you have any hint that would avoid using such identifier?



Thanks a lot,



Benoit





Hello,



In each block each thread has a unique identifier threadIdx which has 3 coordinates .x .y .z. Each block has a unique identifier blockIdx with also 3 coordinates .x .y .z.



If you submit a kernel with the following <<<<Nblocks,tpbl>>> then in the kernel you obtaine and unique number:
int idx=threadIx.x + blockIdx.x*blockDim.x;
where blockDim.x is the size of threads per block (tpbl).

#2
Posted 05/02/2012 04:17 PM   
[quote name='pasoleatis' date='02 May 2012 - 05:17 PM' timestamp='1335975423' post='1403455']
Hello,

In each block each thread has a unique identifier threadIdx which has 3 coordinates .x .y .z. Each block has a unique identifier blockIdx with also 3 coordinates .x .y .z.

If you submit a kernel with the following <<<<Nblocks,tpbl>>> then in the kernel you obtaine and unique number: [code] int idx=threadIx.x + blockIdx.x*blockDim.x; [/code] where blockDim.x is the size of threads per block (tpbl).
[/quote]

Thanks for the answer. Unfortunately, this solution doesn't match my needs. idx might vary between from 0 to nbblock*blocksize which will be a bigger range than 0 to the maximum number of threads that can be run concurrently on a device.

What I would like is more something like the identifier of the ALU running the thread. That way, I won't waste too much memory, having only one copy of my counters for each ALU. Do you know something like that?

Benoît
[quote name='pasoleatis' date='02 May 2012 - 05:17 PM' timestamp='1335975423' post='1403455']

Hello,



In each block each thread has a unique identifier threadIdx which has 3 coordinates .x .y .z. Each block has a unique identifier blockIdx with also 3 coordinates .x .y .z.



If you submit a kernel with the following <<<<Nblocks,tpbl>>> then in the kernel you obtaine and unique number:
int idx=threadIx.x + blockIdx.x*blockDim.x;
where blockDim.x is the size of threads per block (tpbl).





Thanks for the answer. Unfortunately, this solution doesn't match my needs. idx might vary between from 0 to nbblock*blocksize which will be a bigger range than 0 to the maximum number of threads that can be run concurrently on a device.



What I would like is more something like the identifier of the ALU running the thread. That way, I won't waste too much memory, having only one copy of my counters for each ALU. Do you know something like that?



Benoît

#3
Posted 05/02/2012 04:28 PM   
Hello,


Why not just make the counter to reside in the registers (or local memory) or shared memory?
Hello,





Why not just make the counter to reside in the registers (or local memory) or shared memory?

#4
Posted 05/02/2012 04:39 PM   
60,000 counter values will fit on the device easily.

The typical way to sum one counter per thread is to perform a first pass reduction in each block and write out n_blocks totals. Then run a second kernel to complete the final sum reduction.
60,000 counter values will fit on the device easily.



The typical way to sum one counter per thread is to perform a first pass reduction in each block and write out n_blocks totals. Then run a second kernel to complete the final sum reduction.

#5
Posted 05/02/2012 11:01 PM   
Thanks for the answers!

[quote name='pasoleatis' date='02 May 2012 - 05:39 PM' timestamp='1335976786' post='1403465']
Hello,


Why not just make the counter to reside in the registers (or local memory) or shared memory?
[/quote]

That's a way I could look up, thanks. But will 1.2mB fit into the memory of each thread?

[quote name='DrAnderson42' date='03 May 2012 - 12:01 AM' timestamp='1335999718' post='1403595']
60,000 counter values will fit on the device easily.

The typical way to sum one counter per thread is to perform a first pass reduction in each block and write out n_blocks totals. Then run a second kernel to complete the final sum reduction.
[/quote]
If there was only 60k counters it wouldn't be a problem, I agree. The problem is that if I don't try to make any optimization, I would need 60k counters for each of my 10 millions threads. This means a lot of memory...

Benoit
Thanks for the answers!



[quote name='pasoleatis' date='02 May 2012 - 05:39 PM' timestamp='1335976786' post='1403465']

Hello,





Why not just make the counter to reside in the registers (or local memory) or shared memory?





That's a way I could look up, thanks. But will 1.2mB fit into the memory of each thread?



[quote name='DrAnderson42' date='03 May 2012 - 12:01 AM' timestamp='1335999718' post='1403595']

60,000 counter values will fit on the device easily.



The typical way to sum one counter per thread is to perform a first pass reduction in each block and write out n_blocks totals. Then run a second kernel to complete the final sum reduction.



If there was only 60k counters it wouldn't be a problem, I agree. The problem is that if I don't try to make any optimization, I would need 60k counters for each of my 10 millions threads. This means a lot of memory...



Benoit

#6
Posted 05/02/2012 11:21 PM   
The PTX manual (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_2.3.pdf) defines the following special registers:
[code]
%laneid - threadid in warp (0-31)
%warpid - unique ID per SM (0-%nwarpid, Fermi => 0-47)
%nwarpid - maximum warps per SM ( , Fermi = 48)
%smid - unique ID per Device (0-%nsmid, Fermi => 0-15)
[/code]
On GF100 you can have at most 48 warps/SM and 16 SMs.
[code]
unique_warpid = (%smid * %nwarpid) + %warpid
[/code]
These special registers can be queried using inline PTX (see http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Using_Inline_PTX_Assembly_In_CUDA.pdf)

I wasn't able to test this function but for SM2.x I think it would be
[code]
__device__ __inline__ unsigned int __unique_threadid()
{
unsigned int landid;
unsigned int warpid;
unsigned int nwarpid;
unsigned int smid;

asm volatile("mov.u32 %0, %%laneid;" : "=r"(laneid));
asm volatile("mov.u32 %0, %%warpid;" : "=r"(warpid));
asm volatile("mov.u32 %0, %%nwarpid;" : "=r"(nwarpid));
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));

return (smid * nwarpid * 32) + (warpid * 32) + laneid;
}
[/code]
On GF100 the maximum range would be 0 - 24575.
The PTX manual (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_2.3.pdf) defines the following special registers:



%laneid - threadid in warp (0-31)

%warpid - unique ID per SM (0-%nwarpid, Fermi => 0-47)

%nwarpid - maximum warps per SM ( , Fermi = 48)

%smid - unique ID per Device (0-%nsmid, Fermi => 0-15)


On GF100 you can have at most 48 warps/SM and 16 SMs.



unique_warpid = (%smid * %nwarpid) + %warpid


These special registers can be queried using inline PTX (see http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Using_Inline_PTX_Assembly_In_CUDA.pdf)



I wasn't able to test this function but for SM2.x I think it would be



__device__ __inline__ unsigned int __unique_threadid()

{

unsigned int landid;

unsigned int warpid;

unsigned int nwarpid;

unsigned int smid;



asm volatile("mov.u32 %0, %%laneid;" : "=r"(laneid));

asm volatile("mov.u32 %0, %%warpid;" : "=r"(warpid));

asm volatile("mov.u32 %0, %%nwarpid;" : "=r"(nwarpid));

asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));



return (smid * nwarpid * 32) + (warpid * 32) + laneid;

}


On GF100 the maximum range would be 0 - 24575.

#7
Posted 05/03/2012 02:23 AM   
Just a suggestion, try dividing your work in 10 parts for example. This way you divide your need for memory.
Just a suggestion, try dividing your work in 10 parts for example. This way you divide your need for memory.

#8
Posted 05/03/2012 06:59 AM   
[quote name='bhoessen' date='02 May 2012 - 06:21 PM' timestamp='1336000876' post='1403600']
If there was only 60k counters it wouldn't be a problem, I agree. The problem is that if I don't try to make any optimization, I would need 60k counters for each of my 10 millions threads. This means a lot of memory...
[/quote]
Ah, I see now. Your original post was not clear that there were to be 60k counters per thread. That is indeed far too much for either registers or shared memory.

Does [b]every[/b] thread contribute to [b]all[/b] 60k counters? Or does each hit only a few scattered counters? For the scattered writes, you might actually get decent performance by storing only one instance of counters in device memory and then using atomicAdd in the threads. Atomics are fast in Fermi and even faster in Kepler. If you have too many collisions for that to be a viable solution, you could store one set of counters per block in shared memory - threads in that block would use shared memory atomics to update the counters. Unfortunately, 60k counters will not fit in 48k of shared memory, so you will need to run multiple passes to collect all the results.
[quote name='bhoessen' date='02 May 2012 - 06:21 PM' timestamp='1336000876' post='1403600']

If there was only 60k counters it wouldn't be a problem, I agree. The problem is that if I don't try to make any optimization, I would need 60k counters for each of my 10 millions threads. This means a lot of memory...



Ah, I see now. Your original post was not clear that there were to be 60k counters per thread. That is indeed far too much for either registers or shared memory.



Does every thread contribute to all 60k counters? Or does each hit only a few scattered counters? For the scattered writes, you might actually get decent performance by storing only one instance of counters in device memory and then using atomicAdd in the threads. Atomics are fast in Fermi and even faster in Kepler. If you have too many collisions for that to be a viable solution, you could store one set of counters per block in shared memory - threads in that block would use shared memory atomics to update the counters. Unfortunately, 60k counters will not fit in 48k of shared memory, so you will need to run multiple passes to collect all the results.

#9
Posted 05/03/2012 09:30 AM   
Thanks again for every responses!

[quote name='Greg @ NV' date='03 May 2012 - 03:23 AM' timestamp='1336011830' post='1403645']
The PTX manual (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_2.3.pdf) defines the following special registers:
[code]
%laneid - threadid in warp (0-31)
%warpid - unique ID per SM (0-%nwarpid, Fermi => 0-47)
%nwarpid - maximum warps per SM ( , Fermi = 48)
%smid - unique ID per Device (0-%nsmid, Fermi => 0-15)
[/code]
On GF100 you can have at most 48 warps/SM and 16 SMs.
[code]
unique_warpid = (%smid * %nwarpid) + %warpid
[/code]
These special registers can be queried using inline PTX (see http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Using_Inline_PTX_Assembly_In_CUDA.pdf)

I wasn't able to test this function but for SM2.x I think it would be
[code]
__device__ __inline__ unsigned int __unique_threadid()
{
unsigned int laneid;
unsigned int warpid;
unsigned int nwarpid;
unsigned int smid;

asm volatile("mov.u32 %0, %%laneid;" : "=r"(laneid));
asm volatile("mov.u32 %0, %%warpid;" : "=r"(warpid));
asm volatile("mov.u32 %0, %%nwarpid;" : "=r"(nwarpid));
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));

return (smid * nwarpid * 32) + (warpid * 32) + laneid;
}
[/code]
On GF100 the maximum range would be 0 - 24575.
[/quote]

This is doing exactly what I wanted. I just needed to specify -arch compute_20 to nvcc and it worked without any problem. On my card (Quadro 1000M) the range is between 0 - 3071. (Note: 3071 = (nb of multiprocessors * max threads per multiprocessor) - 1 )

[quote name='pasoleatis' date='03 May 2012 - 07:59 AM' timestamp='1336028359' post='1403684']
Just a suggestion, try dividing your work in 10 parts for example. This way you divide your need for memory.
[/quote]

That is a great suggestion. I'm planning to divide the work, but I wanted a "simple" version to start. It will be easier to explain to the rest of the team. And once the team will validate my work, I will start every optimization like this one, and also the ones defined in the Best practice guide.

[quote name='DrAnderson42' date='03 May 2012 - 10:30 AM' timestamp='1336037452' post='1403718']
Ah, I see now. Your original post was not clear that there were to be 60k counters per thread. That is indeed far too much for either registers or shared memory.

Does [b]every[/b] thread contribute to [b]all[/b] 60k counters? Or does each hit only a few scattered counters? For the scattered writes, you might actually get decent performance by storing only one instance of counters in device memory and then using atomicAdd in the threads. Atomics are fast in Fermi and even faster in Kepler. If you have too many collisions for that to be a viable solution, you could store one set of counters per block in shared memory - threads in that block would use shared memory atomics to update the counters. Unfortunately, 60k counters will not fit in 48k of shared memory, so you will need to run multiple passes to collect all the results.
[/quote]

Sorry if my explanation wasn't clear enough. It's not easy to be very clear :-)
Not each thread will contribute to every counters. I had never heard about atomicAdd before. Do you know how does it work internally? Is there some kind of cuda mutex? I will make some measurments and then I will implement both versions and see which is faster. I will look up for this function, thanks for the tip!

Benoit
Thanks again for every responses!



[quote name='Greg @ NV' date='03 May 2012 - 03:23 AM' timestamp='1336011830' post='1403645']

The PTX manual (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_2.3.pdf) defines the following special registers:



%laneid - threadid in warp (0-31)

%warpid - unique ID per SM (0-%nwarpid, Fermi => 0-47)

%nwarpid - maximum warps per SM ( , Fermi = 48)

%smid - unique ID per Device (0-%nsmid, Fermi => 0-15)


On GF100 you can have at most 48 warps/SM and 16 SMs.



unique_warpid = (%smid * %nwarpid) + %warpid


These special registers can be queried using inline PTX (see http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Using_Inline_PTX_Assembly_In_CUDA.pdf)



I wasn't able to test this function but for SM2.x I think it would be



__device__ __inline__ unsigned int __unique_threadid()

{

unsigned int laneid;

unsigned int warpid;

unsigned int nwarpid;

unsigned int smid;



asm volatile("mov.u32 %0, %%laneid;" : "=r"(laneid));

asm volatile("mov.u32 %0, %%warpid;" : "=r"(warpid));

asm volatile("mov.u32 %0, %%nwarpid;" : "=r"(nwarpid));

asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));



return (smid * nwarpid * 32) + (warpid * 32) + laneid;

}


On GF100 the maximum range would be 0 - 24575.





This is doing exactly what I wanted. I just needed to specify -arch compute_20 to nvcc and it worked without any problem. On my card (Quadro 1000M) the range is between 0 - 3071. (Note: 3071 = (nb of multiprocessors * max threads per multiprocessor) - 1 )



[quote name='pasoleatis' date='03 May 2012 - 07:59 AM' timestamp='1336028359' post='1403684']

Just a suggestion, try dividing your work in 10 parts for example. This way you divide your need for memory.





That is a great suggestion. I'm planning to divide the work, but I wanted a "simple" version to start. It will be easier to explain to the rest of the team. And once the team will validate my work, I will start every optimization like this one, and also the ones defined in the Best practice guide.



[quote name='DrAnderson42' date='03 May 2012 - 10:30 AM' timestamp='1336037452' post='1403718']

Ah, I see now. Your original post was not clear that there were to be 60k counters per thread. That is indeed far too much for either registers or shared memory.



Does every thread contribute to all 60k counters? Or does each hit only a few scattered counters? For the scattered writes, you might actually get decent performance by storing only one instance of counters in device memory and then using atomicAdd in the threads. Atomics are fast in Fermi and even faster in Kepler. If you have too many collisions for that to be a viable solution, you could store one set of counters per block in shared memory - threads in that block would use shared memory atomics to update the counters. Unfortunately, 60k counters will not fit in 48k of shared memory, so you will need to run multiple passes to collect all the results.





Sorry if my explanation wasn't clear enough. It's not easy to be very clear :-)

Not each thread will contribute to every counters. I had never heard about atomicAdd before. Do you know how does it work internally? Is there some kind of cuda mutex? I will make some measurments and then I will implement both versions and see which is faster. I will look up for this function, thanks for the tip!



Benoit

#10
Posted 05/03/2012 02:18 PM   
Scroll To Top