I had a program working with variable "a" in the video memory and it works fine. Then I want to pass that variable as shared in the below way. The scope of "a" must be in a single block. Then if I run multiple blocks as the example below, "a" must be allocated multiple times one in each block. But each block needs almost all shared memory of a smx. Then despite each smx has 192 cores it runs only one block of 32 threads because it has only shared memory for that block. Is that what happen?
I asking that because it is given bad results. It seems that it has someway race problems and it is not happening like I said above.
line 4: you likely do not need device together with shared
it is not entirely clear what you are saying/ asking
“I want to pass that variable as shared”
i presume “a” originates from global memory, and is read into shared
“must be allocated multiple times one in each block”
i.e. “a” is allocated on a block basis, it is allocated for each block, and its size is not necessarily dependent on the total number of blocks
“But each block needs almost all shared memory of a smx”
perhaps.
an alternative is to loop over the variable, in order to fully cover it
if “a” has length of x, you can cover it with a smaller sized block, by looping within the block
it depends on the reason why you wish to use shared memory in the first place
“Then despite each smx has 192 cores it runs only one block of 32 threads because it has only shared memory for that block”
A: each smx has 192 cores
B: it runs only one block of 32 threads because it has only shared memory for that block
i am not sure there is a direct link between A and B
Before, in a previous program I declared “a” in video memory. In a new upgraded program I want that variable be of type shared (and I remove it from video memory (global memory)). That variable is transitory.
My question is: if I have only one smx, in the program above, what happens?
A: In 192 cores, run 192/32 blocks at the same time and the shared memory “a” is written by all concurrently. In that case how I run each block at the time in order each block have exclusive access to “a”
B: or it is ran only on block at time because there is only shared memory for one block at a time
if “a” was in global memory, and if “a” is transitory, how did you prevent races at global memory level between thread blocks?
the potential for and treatment of races on the variable are more or less the same, regardless of whether the variable is in shared, or in global memory
what is the depth (array length) of a?
do you initialize it at the very beginning with input data?
do all thread blocks access the entire variable array, or only parts thereof?
with regards to B:
to view it differently, the gpu would look at the local and shared memory requirements per block, among other things, and then decide how many blocks can be seated per sm and for a particular sm, at the same time
-“a” in global memory is a[54721614] (with CudaMalloc). The program runs fine
i presume this is followed by a h2d memory copy soon afterwards?
if you use shared memory, you still need the initiating h2d memory copy, and a read from global to shared
also, i am not sure about your ‘coupling’
essentially, i would think that, if you use a structure, you must read in a structure; otherwise, you need to deep copy at some point
(i can not believe i am about to reference the debugger)
you could also use the debugger and peak at the data in shared memory, to see whether it is valid/ garbage
The shared (or global memory in the first program) memory is initiated within the threads, it is transitory memory. I think I do not need to copy from anywhere.
The shared memory is organized in 32 x complex1,32 x complex1,(x171)
Each thread (of the block) have a complex1 in each group of 32 x complex1
taking advantage of the topic… a variable declared in the shared memory has same content for all blocks? For example:
__global__ void kernel(){
__shared__ int var;
var = -1;
if(blockIdx.x == 0 && threadIdx.x == 0){
var = 0;
}
else if(blockIdx.x == 1 && threadIdx.x == 0){
var = 1;
}
printf("Block: %d thread %d valor %d\n", blockIdx.x, threadIdx.x, var);
}
int main(){
kernel <<< 2, 1024 >>> ();
cudaDeviceSynchronize();
return 0;
}
If blockIdx.x == 0 then all threads of block 0 var is 0 at same time that blockIdx.x == 1 then all threads of block 1 var is 1 ? or the content var is changed for both blocks?