What happen if memory is full?
Hi,

All in the title. I mean if the shared memory of the block is full what happen? does the program stop or the global memory is use?

And I guess that when global memory is full we have an memory error, such as 'unspecified launch failure'
Hi,



All in the title. I mean if the shared memory of the block is full what happen? does the program stop or the global memory is use?



And I guess that when global memory is full we have an memory error, such as 'unspecified launch failure'

#1
Posted 03/28/2012 02:31 PM   
How do you expect shared memory to become "full"? There are no dynamic allocations for shared memory. The only way in which shared memory might no be large enough is when the static allocation + memory allocation at kernel launch exceed the shared memory per SM, in which case the kernel does not launch at all.

EDIT: Correct typo that distorted the meaning.
How do you expect shared memory to become "full"? There are no dynamic allocations for shared memory. The only way in which shared memory might no be large enough is when the static allocation + memory allocation at kernel launch exceed the shared memory per SM, in which case the kernel does not launch at all.



EDIT: Correct typo that distorted the meaning.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 03/28/2012 02:58 PM   
[quote name='Dext' date='28 March 2012 - 08:31 AM' timestamp='1332945086' post='1389047']
Hi,

All in the title. I mean if the shared memory of the block is full what happen? does the program stop or the global memory is use?

And I guess that when global memory is full we have an memory error, such as 'unspecified launch failure'
[/quote]

Assuming you aren't using malloc() in your kernel code, you will get a specific out of memory error code (cudaErrorMemoryAllocation, I believe) from cudaMalloc() if your request exceeds the available global memory. (Remember to check all those return codes, even if you just plan to abort!) An unspecified launch failure indicates something more like a memory access violation.

The only way a kernel can hit a shared memory limit is if it requests more than the total configured shared memory per multiprocessor (16 kB on compute capability < 2.0, 16 or 48 kB on compute capability 2.x, and 16, 32, or 48 kB on compute capability 3.0). In that case, the kernel will fail to launch, and the next CUDA function should return cudaErrorInvalidConfiguration. Because shared memory is not dynamically allocated, it is impossible for a kernel to hit an "out of shared memory" condition while running.
[quote name='Dext' date='28 March 2012 - 08:31 AM' timestamp='1332945086' post='1389047']

Hi,



All in the title. I mean if the shared memory of the block is full what happen? does the program stop or the global memory is use?



And I guess that when global memory is full we have an memory error, such as 'unspecified launch failure'





Assuming you aren't using malloc() in your kernel code, you will get a specific out of memory error code (cudaErrorMemoryAllocation, I believe) from cudaMalloc() if your request exceeds the available global memory. (Remember to check all those return codes, even if you just plan to abort!) An unspecified launch failure indicates something more like a memory access violation.



The only way a kernel can hit a shared memory limit is if it requests more than the total configured shared memory per multiprocessor (16 kB on compute capability < 2.0, 16 or 48 kB on compute capability 2.x, and 16, 32, or 48 kB on compute capability 3.0). In that case, the kernel will fail to launch, and the next CUDA function should return cudaErrorInvalidConfiguration. Because shared memory is not dynamically allocated, it is impossible for a kernel to hit an "out of shared memory" condition while running.

#3
Posted 03/28/2012 03:05 PM   
@tera
[quote]
The only way in which static memory might no be large enough is when the static allocation + memory allocation at kernel launch exceed the shared memory per SM.
[/quote]

Yes I think that's it. I am not sure what means SM? I still have some trouble with memory (as you can see), I hope to be clear enough.

I did some compute and I can launch kernel if the static allocation + memory allocation is greater than the total amount of shared memory per block given by deviceQuery (49152 bytes). Indeed, I allocate ( with malloc, so dynamic ? / and with float for example ) at least 1740 bytes and launch 54 threads per block (XBLOCK=54), which give 54*1740 = 93960 bytes per block (so in shared memory?) > 49152 bytes.

I may misunderstood the way of shared memory is used.

@seibert
I try to do a memCpy after my kernel and it returns 'unspecified launch failure' if I increase XBLOCK or XGRID (I have YBLOCK and YGRID egal to 1). But if I have small values for XBLOCK and XGRID (54 and 35 for example) my program works (slowly but it works).
For each threads I make 2 malloc of 206*sizeof(float) = 1664 bytes, so maybe it is too big? If I improve XBLOCK, some malloc fail and I get 'unspecified launch failure' after the kernel (you will say of course your array is not define because malloc failed).
I really guess malloc failed because of lack of memory.
What do you think? Is there issue ti this problem, or I only have to try to reduce the use of memory?


I really appreciate your answers and I hope to be as clear as you, but I am not sure. Sorry if I am unclear.
@tera



The only way in which static memory might no be large enough is when the static allocation + memory allocation at kernel launch exceed the shared memory per SM.





Yes I think that's it. I am not sure what means SM? I still have some trouble with memory (as you can see), I hope to be clear enough.



I did some compute and I can launch kernel if the static allocation + memory allocation is greater than the total amount of shared memory per block given by deviceQuery (49152 bytes). Indeed, I allocate ( with malloc, so dynamic ? / and with float for example ) at least 1740 bytes and launch 54 threads per block (XBLOCK=54), which give 54*1740 = 93960 bytes per block (so in shared memory?) > 49152 bytes.



I may misunderstood the way of shared memory is used.



@seibert

I try to do a memCpy after my kernel and it returns 'unspecified launch failure' if I increase XBLOCK or XGRID (I have YBLOCK and YGRID egal to 1). But if I have small values for XBLOCK and XGRID (54 and 35 for example) my program works (slowly but it works).

For each threads I make 2 malloc of 206*sizeof(float) = 1664 bytes, so maybe it is too big? If I improve XBLOCK, some malloc fail and I get 'unspecified launch failure' after the kernel (you will say of course your array is not define because malloc failed).

I really guess malloc failed because of lack of memory.

What do you think? Is there issue ti this problem, or I only have to try to reduce the use of memory?





I really appreciate your answers and I hope to be as clear as you, but I am not sure. Sorry if I am unclear.

#4
Posted 03/28/2012 03:28 PM   
[quote name='Dext' date='28 March 2012 - 03:28 PM' timestamp='1332948480' post='1389076']
Yes I think that's it. I am not sure what means SM? I still have some trouble with memory (as you can see), I hope to be clear enough.
[/quote]
Here SM stands for Streaming Multiprocessor, i.e. the instance that shares a shared memory block.

[quote name='Dext' date='28 March 2012 - 03:28 PM' timestamp='1332948480' post='1389076']
I did some compute and I can launch kernel if the static allocation + memory allocation is greater than the total amount of shared memory per block given by deviceQuery (49152 bytes). Indeed, I allocate ( with malloc, so dynamic ? / and with float for example ) at least 1740 bytes and launch 54 threads per block (XBLOCK=54), which give 54*1740 = 93960 bytes per block (so in shared memory?) > 49152 bytes.
[/quote]
Sorry, I had a meaning-distorting typo in that post. You can have more than 49152 bytes of [i]static[/i] memory, but not more of 49152 bytes of [i]shared[/i] memory per block.

In-kernel malloc() (or any dynamic allocation for that matter) doesn't give you shared memory, but global memory.
[quote name='Dext' date='28 March 2012 - 03:28 PM' timestamp='1332948480' post='1389076']

Yes I think that's it. I am not sure what means SM? I still have some trouble with memory (as you can see), I hope to be clear enough.



Here SM stands for Streaming Multiprocessor, i.e. the instance that shares a shared memory block.



[quote name='Dext' date='28 March 2012 - 03:28 PM' timestamp='1332948480' post='1389076']

I did some compute and I can launch kernel if the static allocation + memory allocation is greater than the total amount of shared memory per block given by deviceQuery (49152 bytes). Indeed, I allocate ( with malloc, so dynamic ? / and with float for example ) at least 1740 bytes and launch 54 threads per block (XBLOCK=54), which give 54*1740 = 93960 bytes per block (so in shared memory?) > 49152 bytes.



Sorry, I had a meaning-distorting typo in that post. You can have more than 49152 bytes of static memory, but not more of 49152 bytes of shared memory per block.



In-kernel malloc() (or any dynamic allocation for that matter) doesn't give you shared memory, but global memory.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#5
Posted 03/28/2012 03:44 PM   
Thanks a lot, I will think about all of this to improve my program and I will ask other questions if I need

[quote name='tera' date='28 March 2012 - 05:44 PM' timestamp='1332949449' post='1389090']
In-kernel malloc() (or any dynamic allocation for that matter) doesn't give you shared memory, but global memory.
[/quote]

And what about the "usual allocation", for example
[code]float a;
[/code]
Is it allocated in global memory as static allocation?
Thanks a lot, I will think about all of this to improve my program and I will ask other questions if I need



[quote name='tera' date='28 March 2012 - 05:44 PM' timestamp='1332949449' post='1389090']

In-kernel malloc() (or any dynamic allocation for that matter) doesn't give you shared memory, but global memory.





And what about the "usual allocation", for example

float a;


Is it allocated in global memory as static allocation?

#6
Posted 03/28/2012 04:07 PM   
That will either give you a register, or "local" memory (basically global memory, with a different layout to improve coalescing).
That will either give you a register, or "local" memory (basically global memory, with a different layout to improve coalescing).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#7
Posted 03/28/2012 04:27 PM   
[quote name='tera' date='28 March 2012 - 06:27 PM' timestamp='1332952034' post='1389112']
That will either give you a register, or "local" memory (basically global memory, with a different layout to improve coalescing).
[/quote]

One more time, thanks a lot tera. Your answers are really clear.
[quote name='tera' date='28 March 2012 - 06:27 PM' timestamp='1332952034' post='1389112']

That will either give you a register, or "local" memory (basically global memory, with a different layout to improve coalescing).





One more time, thanks a lot tera. Your answers are really clear.

#8
Posted 03/29/2012 08:39 AM   
Scroll To Top