How to choose how many threads/blocks to have?
So I have a Jetson TX1, which as I understand has 2 SM's - each with 128 cores. I read that per SM (which I understand there are 2) there can be a maximum of 16 active blocks, and 64 active warps (or 2048 active threads). Now I have copied an example, which has chosen block sizes and number of threads etc, but coming into it blind, how would I know how many to choose? Do I want to reach both of these maxima for best performance? IE Make it so that 128 threads per block? Or is it better to have more threads per block, but then not reach the active block maximum? Or even less threads per block? Sorry - follow up question - why are the maximums for blocks per SM and warps per SM the way they are? I know this is pretty much a "how does this work?" question. But I genuinely can't find a decent answer in a concise form. A link would suffice. Many thanks! :)
So I have a Jetson TX1, which as I understand has 2 SM's - each with 128 cores.
I read that per SM (which I understand there are 2) there can be a maximum of 16 active blocks, and 64 active warps (or 2048 active threads).

Now I have copied an example, which has chosen block sizes and number of threads etc, but coming into it blind, how would I know how many to choose?

Do I want to reach both of these maxima for best performance? IE Make it so that 128 threads per block? Or is it better to have more threads per block, but then not reach the active block maximum? Or even less threads per block?

Sorry - follow up question - why are the maximums for blocks per SM and warps per SM the way they are? I know this is pretty much a "how does this work?" question. But I genuinely can't find a decent answer in a concise form. A link would suffice.

Many thanks! :)

#1
Posted 11/24/2017 05:10 PM   
The only thing that really matters for occupancy and if performance depends on occupancy is warps. You want to have as close to 64 active warps as possible, all other factors being equal. However, this does not mean necessarily that your code is somehow deficient if you do not have 64 active warps. OTOH a really low level of active warps, say less than 32, or less than 16, may be a strong indicator that occupancy (i.e. a low level of achieved occupancy) might be a factor to consider in the performance of your code. The maximums are hardware maximums. Each open block requires a certain amount of "state" to be maintained for it. Therefore it's not possible to create a HW design that supports an infinite number of open blocks per SM. And its not desirable to burden the HW design with maintaining state for 64 blocks when 16 blocks will suffice for nearly all purposes - simply make sure to choose at least 128 threads per block for your code, if this aspect of performance/occupancy is an issue. Therefore very small block sizes (e.g. 32 threads per block) may limit performance due to occupancy. Very large block sizes for example 1024 threads per block, may also limit performance, if there are resource limits (e.g. registers per thread usage, or shared memory usage) which prevent 2 threadblocks (in this example of 1024 threads per block) from being resident on a SM Threadblock size choices in the range of 128 - 512 are less likely to run into the aforementioned issues. Usually there are not huge differences in performance for a code between, say, a choice of 128 threads per block and a choice of 256 threads per block. Due to warp granularity, it's always recommended to choose a size that is a multiple of 32, and powers-of-2 threadblock size choices are also pretty common, but not necessary. A good basic sequence of CUDA courses would follow a CUDA 101 type class, which will familiarize with CUDA syntax, followed by an "optimization" class, which will teach the first 2 most important optimization objectives: 1. Choosing enough threads to saturate the machine and give the machine the best chance to hide latency 2. efficient use of the memory subsystem(s) Such classes/presentations can be readily found by searching on e.g. "gtc cuda optimization"
The only thing that really matters for occupancy and if performance depends on occupancy is warps. You want to have as close to 64 active warps as possible, all other factors being equal.

However, this does not mean necessarily that your code is somehow deficient if you do not have 64 active warps. OTOH a really low level of active warps, say less than 32, or less than 16, may be a strong indicator that occupancy (i.e. a low level of achieved occupancy) might be a factor to consider in the performance of your code.

The maximums are hardware maximums. Each open block requires a certain amount of "state" to be maintained for it. Therefore it's not possible to create a HW design that supports an infinite number of open blocks per SM. And its not desirable to burden the HW design with maintaining state for 64 blocks when 16 blocks will suffice for nearly all purposes - simply make sure to choose at least 128 threads per block for your code, if this aspect of performance/occupancy is an issue.

Therefore very small block sizes (e.g. 32 threads per block) may limit performance due to occupancy. Very large block sizes for example 1024 threads per block, may also limit performance, if there are resource limits (e.g. registers per thread usage, or shared memory usage) which prevent 2 threadblocks (in this example of 1024 threads per block) from being resident on a SM

Threadblock size choices in the range of 128 - 512 are less likely to run into the aforementioned issues. Usually there are not huge differences in performance for a code between, say, a choice of 128 threads per block and a choice of 256 threads per block. Due to warp granularity, it's always recommended to choose a size that is a multiple of 32, and powers-of-2 threadblock size choices are also pretty common, but not necessary.

A good basic sequence of CUDA courses would follow a CUDA 101 type class, which will familiarize with CUDA syntax, followed by an "optimization" class, which will teach the first 2 most important optimization objectives:

1. Choosing enough threads to saturate the machine and give the machine the best chance to hide latency
2. efficient use of the memory subsystem(s)

Such classes/presentations can be readily found by searching on e.g. "gtc cuda optimization"

#2
Posted 11/24/2017 06:02 PM   
Many thanks for your reply. That clears up a lot. :)
Many thanks for your reply. That clears up a lot. :)

#3
Posted 11/27/2017 10:32 AM   
I actually had a very similar issue / question. I followed a relatively detailed table collecting information on individual CUDA-enabled GPUs available at: https://en.wikipedia.org/wiki/CUDA (mid-page). I use 780Ti for development work (CUDA 3.5 capable) and have been looking for any indication on how to select optimum values for the block size and thread count for my application. At this time, I settled (through trial and error) on 1024 threads and 64 blocks but it gives me ~95% execution success. Sometimes application just crashes for no reason at all. What I am trying to do is obviously squeeze every single cycle out of the GPU for compute purposes. Looking at the referenced Wiki page, for my GPU, I can see that parameter "Maximum number of threads per block" is equal to 1024 (the value I use already in my application) and then maximum block sizes are listed as "Maximum x-dimension of a grid of thread blocks" = 2^31-1, and "Maximum y-, or z-dimension of a grid of thread blocks" = 65635. This makes no sense to me in any way. At the same time, "Maximum number of resident grids per device" = 32, which seems to be closer to stable operating conditions (1024 x 32) I observe through trial and error. To help clarify the concepts, I spent better part of several last days going through white papers, guides, implementation examples, etc. and still there is no single reference (apart from the Wiki page) where information is collected in any organized fashion. if there is a better reference, please let me know. Otherwise, which of the values from the Wiki table should be taken as maximum for thread and block size count? As a bonus, is there any way to discover these values during execution time and set dynamically? I am doing development on 780Ti, but at the end of the day, execution will be done on a cluster of V100s, so I am trying to make all possible parameters discover dynamically at execution time (CPU type, number of cores, threads, GPU type, etc.) Many thanks in advance and I hope this makes any sense ...
I actually had a very similar issue / question. I followed a relatively detailed table collecting information on individual CUDA-enabled GPUs available at: https://en.wikipedia.org/wiki/CUDA (mid-page). I use 780Ti for development work (CUDA 3.5 capable) and have been looking for any indication on how to select optimum values for the block size and thread count for my application. At this time, I settled (through trial and error) on 1024 threads and 64 blocks but it gives me ~95% execution success. Sometimes application just crashes for no reason at all. What I am trying to do is obviously squeeze every single cycle out of the GPU for compute purposes.

Looking at the referenced Wiki page, for my GPU, I can see that parameter "Maximum number of threads per block" is equal to 1024 (the value I use already in my application) and then maximum block sizes are listed as "Maximum x-dimension of a grid of thread blocks" = 2^31-1, and "Maximum y-, or z-dimension of a grid of thread blocks" = 65635. This makes no sense to me in any way. At the same time, "Maximum number of resident grids per device" = 32, which seems to be closer to stable operating conditions (1024 x 32) I observe through trial and error.

To help clarify the concepts, I spent better part of several last days going through white papers, guides, implementation examples, etc. and still there is no single reference (apart from the Wiki page) where information is collected in any organized fashion. if there is a better reference, please let me know. Otherwise, which of the values from the Wiki table should be taken as maximum for thread and block size count?

As a bonus, is there any way to discover these values during execution time and set dynamically? I am doing development on 780Ti, but at the end of the day, execution will be done on a cluster of V100s, so I am trying to make all possible parameters discover dynamically at execution time (CPU type, number of cores, threads, GPU type, etc.)

Many thanks in advance and I hope this makes any sense ...

#4
Posted 11/28/2017 12:06 AM   
[quote=""]Sometimes application just crashes for no reason at all.[/quote] Well, you should probably debug that rather than looking for a special set of operating conditions to avoid the failure. If you don't know what is causing the failure, you don't really know if you have a fix. [quote=""] then maximum block sizes are listed as "Maximum x-dimension of a grid of thread blocks" = 2^31-1, and "Maximum y-, or z-dimension of a grid of thread blocks" = 65635. This makes no sense to me in any way.[/quote] A CUDA kernel launch: mykernel<<<A,B,C,D>>>(...); has a set of launch configuration parameters, contained in the triple-chevron <<<...>>> syntax. The first parameter (A) is the number of blocks to launch, expressed as a dim3 (3-dimensional) variable. The second parameter (B) is the number of threads per block, which can also be expressed 3-dimensionally. For a thread block, you have a limit on the total number of threads (1024) as well as a limit on each dimension. The total number of threads in a block is the product of the 3 thread block dimensions. The set of all blocks associated with a kernel launch is referred to as the [b]grid[/b]. As already mentioned, the grid size is expressed using the first kernel launch config parameter, and it has relevant limits for each dimension, which is where the 2^31-1 and 65535 numbers are coming from. [quote=""] "Maximum number of resident grids per device" = 32[/quote] This refers to concurrent kernels. Probably you are not dealing with concurrent kernels. There is a 1:1 correspondence between a kernel launch and its appropriate grid, so having multiple grids resident means concurrent kernels. [quote=""]As a bonus, is there any way to discover these values during execution time and set dynamically? [/quote] Take a look at the CUDA concurrent kernels sample code. Regarding not being able to find answers about this, it is fairly basic CUDA 101 type info. Here is an introductory CUDA talk that will expose you to the difference between threads and blocks (which you seem to have not grasped): [url]http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf[/url] Here's a recent one that also covers threads, blocks and grid dimensions: [url]https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/[/url]
said:Sometimes application just crashes for no reason at all.


Well, you should probably debug that rather than looking for a special set of operating conditions to avoid the failure. If you don't know what is causing the failure, you don't really know if you have a fix.

said:
then maximum block sizes are listed as "Maximum x-dimension of a grid of thread blocks" = 2^31-1, and "Maximum y-, or z-dimension of a grid of thread blocks" = 65635. This makes no sense to me in any way.


A CUDA kernel launch:

mykernel<<<A,B,C,D>>>(...);

has a set of launch configuration parameters, contained in the triple-chevron <<<...>>> syntax. The first parameter (A) is the number of blocks to launch, expressed as a dim3 (3-dimensional) variable. The second parameter (B) is the number of threads per block, which can also be expressed 3-dimensionally. For a thread block, you have a limit on the total number of threads (1024) as well as a limit on each dimension. The total number of threads in a block is the product of the 3 thread block dimensions. The set of all blocks associated with a kernel launch is referred to as the grid. As already mentioned, the grid size is expressed using the first kernel launch config parameter, and it has relevant limits for each dimension, which is where the 2^31-1 and 65535 numbers are coming from.


said: "Maximum number of resident grids per device" = 32


This refers to concurrent kernels. Probably you are not dealing with concurrent kernels. There is a 1:1 correspondence between a kernel launch and its appropriate grid, so having multiple grids resident means concurrent kernels.


said:As a bonus, is there any way to discover these values during execution time and set dynamically?


Take a look at the CUDA concurrent kernels sample code.

Regarding not being able to find answers about this, it is fairly basic CUDA 101 type info. Here is an introductory CUDA talk that will expose you to the difference between threads and blocks (which you seem to have not grasped):

http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf

Here's a recent one that also covers threads, blocks and grid dimensions:

https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/

#5
Posted 11/28/2017 12:29 AM   
As far as I can see, multiple issues are getting conflated here. (1) There are architecture-dependent, hardware-imposed, limits on grid and block dimensions. There are also other architecture-dependent resource limits, e.g. on shared memory size or register usage. These are documented in the CUDA Programming Guide. (2) Within the limitations imposed by hardware, what thread and block configuration results in the highest performance for a given GPU depends on the code that is being run. Block configuration in particular interacts with other resource limits in terms of occupancy. There is no universal formula for the "best" configuration, which explains why you cannot find one online or in published articles. Some [i]basic[/i] heuristics for reasonable performance in many uses cases are: 10K+ total threads, 500+ blocks, 128-256 threads/blocks. One can find the "optimal" configuration for a given code on a given GPU by experimentation, in particular an automated search of the space of possible configurations. Such an [i]auto-tuning[/i] approach has been used by widely-used applications on CPUs for at least 20 years, e.g. ATLAS and FFTW. (3) Supposed stability issues are most often the result of unidentified software bugs, in particular race conditions and out-of-bounds accesses. cuda-memcheck is a tool for first-line defense against these. Actual instability due to hardware does happen occasionally, most often on consumer-grade GPUs (e.g. lack of ECC), and in particular [i]vendor-overclocked[/i] GPUs that run at frequencies higher than NVIDIA's reference cards. The GTX 780Ti seems to be a special case, however, because it is the only GPU specifically advised against by the AMBER project because of stability issues: [url]http://ambermd.org/gpus/[/url] [quote]GTX-780TI Caution: With the exception of Exxact Amber certified cards the GTX-780Ti cards are NOT recommended at this time due to instability and numerical accuracy during MD simulations which we have tracked down to a specific hardware design flaw.[/quote]
As far as I can see, multiple issues are getting conflated here.

(1) There are architecture-dependent, hardware-imposed, limits on grid and block dimensions. There are also other architecture-dependent resource limits, e.g. on shared memory size or register usage. These are documented in the CUDA Programming Guide.

(2) Within the limitations imposed by hardware, what thread and block configuration results in the highest performance for a given GPU depends on the code that is being run. Block configuration in particular interacts with other resource limits in terms of occupancy. There is no universal formula for the "best" configuration, which explains why you cannot find one online or in published articles. Some basic heuristics for reasonable performance in many uses cases are: 10K+ total threads, 500+ blocks, 128-256 threads/blocks.

One can find the "optimal" configuration for a given code on a given GPU by experimentation, in particular an automated search of the space of possible configurations. Such an auto-tuning approach has been used by widely-used applications on CPUs for at least 20 years, e.g. ATLAS and FFTW.

(3) Supposed stability issues are most often the result of unidentified software bugs, in particular race conditions and out-of-bounds accesses. cuda-memcheck is a tool for first-line defense against these. Actual instability due to hardware does happen occasionally, most often on consumer-grade GPUs (e.g. lack of ECC), and in particular vendor-overclocked GPUs that run at frequencies higher than NVIDIA's reference cards. The GTX 780Ti seems to be a special case, however, because it is the only GPU specifically advised against by the AMBER project because of stability issues:

http://ambermd.org/gpus/
GTX-780TI Caution: With the exception of Exxact Amber certified cards the GTX-780Ti cards are NOT recommended at this time due to instability and numerical accuracy during MD simulations which we have tracked down to a specific hardware design flaw.

#6
Posted 11/28/2017 12:37 AM   
Thank you, njuffa - in other words, 780Ti is good for gaming, but not really advisable for any serious dev work in CUDA, if I read that right. I just checked the vendor and it is PNY. Just my luck :( All the stability issues I am observing might then related with the said flaw in the GPU itself and not any issue in the code itself.
Thank you, njuffa - in other words, 780Ti is good for gaming, but not really advisable for any serious dev work in CUDA, if I read that right. I just checked the vendor and it is PNY. Just my luck :(

All the stability issues I am observing might then related with the said flaw in the GPU itself and not any issue in the code itself.

#7
Posted 11/28/2017 01:29 AM   
I have no specific insights into the GTX 780 Ti as I have never used that GPU. I simply pointed to what looked like potentially pertinent information about an issue from people who I believe have reasonable insights into that issue. There is general risk when using vendor-overclocked parts ([i]regardless[/i] of the vendor): While the vendors appear to guarantee proper operation for graphics applications, I have seen no information that gives me reason to believe that compute applications are part of their qualification process. Graphics applications (and games in particular), tend to have a different usage profile relative to the various functional units in a GPU than compute applications. Also, any minor errors in a graphics application will likely last for the duration of a frame and will never be noticed, while errors may propagate in compute applications. The amount of risk differs by the nature of the computation (e.g Monte-Carlo computations may tolerate an occasional error as it contributes very little to the final result), and also by the aggressiveness of the vendor-provided overclocking. Some vendors appear to provide up to three levels of overclocking: mild overclocking, ambitious overclocking, and insane overclocking, usually readily distinguished by price level. From what I have seen, GPUs with only mild overclocking tend to be stable for most compute applications. Excessive heat and especially insufficient power supply can also contribute to hardware instability (modern processors tend to have occasional power spikes of very short duration), as does operation in an environment with lots of electromagnetic noise (e.g. factory floor) or with increased radiation (e.g. extreme altitudes) which can effect the reliability of DRAM. Overall, instances of true hardware instability seem to be much rarer than cases of latent software bugs. For example, with race conditions, software may [i]seem[/i] to work perfectly on one GPU for months on end, only to fail once moved to a slightly different GPU model. Running cuda-memcheck provides good first-line protection against such issues, but it cannot find all bugs.
I have no specific insights into the GTX 780 Ti as I have never used that GPU. I simply pointed to what looked like potentially pertinent information about an issue from people who I believe have reasonable insights into that issue.

There is general risk when using vendor-overclocked parts (regardless of the vendor): While the vendors appear to guarantee proper operation for graphics applications, I have seen no information that gives me reason to believe that compute applications are part of their qualification process. Graphics applications (and games in particular), tend to have a different usage profile relative to the various functional units in a GPU than compute applications. Also, any minor errors in a graphics application will likely last for the duration of a frame and will never be noticed, while errors may propagate in compute applications.

The amount of risk differs by the nature of the computation (e.g Monte-Carlo computations may tolerate an occasional error as it contributes very little to the final result), and also by the aggressiveness of the vendor-provided overclocking. Some vendors appear to provide up to three levels of overclocking: mild overclocking, ambitious overclocking, and insane overclocking, usually readily distinguished by price level. From what I have seen, GPUs with only mild overclocking tend to be stable for most compute applications.

Excessive heat and especially insufficient power supply can also contribute to hardware instability (modern processors tend to have occasional power spikes of very short duration), as does operation in an environment with lots of electromagnetic noise (e.g. factory floor) or with increased radiation (e.g. extreme altitudes) which can effect the reliability of DRAM.

Overall, instances of true hardware instability seem to be much rarer than cases of latent software bugs. For example, with race conditions, software may seem to work perfectly on one GPU for months on end, only to fail once moved to a slightly different GPU model. Running cuda-memcheck provides good first-line protection against such issues, but it cannot find all bugs.

#8
Posted 11/28/2017 03:06 AM   
780Ti was indeed the primary reason for stability issues on my system. Last night I replaced it with 1080 (NVidia manufactured, made sure it does come with stock settings) and right now all the crashes I was experiencing before are just gone. Thank you njuffa - that was a very good pointer and I would have never even looked for a hardware specific issue as far as GPU goes. Lesson learned and KUDOS to you.
780Ti was indeed the primary reason for stability issues on my system. Last night I replaced it with 1080 (NVidia manufactured, made sure it does come with stock settings) and right now all the crashes I was experiencing before are just gone. Thank you njuffa - that was a very good pointer and I would have never even looked for a hardware specific issue as far as GPU goes. Lesson learned and KUDOS to you.

#9
Posted 11/28/2017 11:16 PM   
I am back and still struggling with memory access errors for some reason. For varGpuBlockSize = 256 * 512 = 131072 (256 blocks are used, 512 threads per block), with varDelimiterSize = 66 (length of the bit sequence) and paralellisation factor varGpuBlockMultiplier = 64, I allocate 553,648,128 bits = 66MB of memory even. bool * varSequenceDelimiter = NULL; cudaMallocManaged(&varSequenceDelimiter, varGpuBlockMultiplier * varGpuBlockSize * varDelimiterSize * sizeof(bool)); subIntializeDelimiters <<<varGpuBlockCount, varGpuThreadCount>>> (varSequenceDelimiter, varDelimiterSize, varSpaceSizeReserved, varGpuBlockCount, varGpuThreadCount, varGpuBlockSize, varGpuBlockMultiplier); cudaDeviceSynchronize(); the kernel is very simple, just going through individual thread block instances and parallel combinations of each thread block and just display parameters, not even modifying memory content. __global__ void subIntializeDelimiters(bool * varSequenceDelimiter, int varDelimiterSize, int varSpaceSizeReserved, int varGpuBlockCount, int varGpuThreadCount, int varGpuBlockSize, int varGpuBlockMultiplier) { // repeat the process within each of the GPU blocks // counting starts from 1, to account correctly for case of 1 GPU block for (int varGpuBlock = 0; varGpuBlock < varGpuBlockMultiplier; varGpuBlock++) { // calculate the relative position start for this thread unsigned long long int varElementNumber = varGpuThreadCount * blockIdx.x + threadIdx.x + varGpuBlockSize * varGpuBlock; unsigned long long int varPositionStart = varDelimiterSize * varElementNumber; printf("\n[B%d,T%d,BC%d] - position start: %llu for element: %llu", blockIdx.x, threadIdx.x, varGpuBlock, varPositionStart, varElementNumber); } } there are 512 * 256 = 131072 thread blocks = thread number range then [0;131071] calculated as: varGpuThreadCount * blockIdx.x + threadIdx.x thread number min => 512 * 0 + 0 = 0 (OK) thread number max => 512 * 255 + 511 = 131071 (OK) The varElementNumber (element number) is calculated in the function of thread block number, and paralellisation factor varGpuBlock: (varGpuThreadCount * blockIdx.x + threadIdx.x) + varGpuBlockSize * varGpuBlock. With paralleisation factor of 64, there are 131072 * 64 = 8,388,608 elements in total to examine, ranging [0;8,388,607]. thread block 1 start => (512 * 0 + 0) + 131072 * 0 = 0 (OK) thread block 1 end => (512 * 255 + 511) + 131072 * 0 = 131071 (OK) thread block 2 start => (512 * 0 + 0) + 131072 * 1 = 131072 (OK) thread block 2 end => (512 * 255 + 511) + 131072 * 1 = 262143 (OK) thread block 3 start => (512 * 0 + 0) + 131072 * 2 = 262144 (OK) thread block 3 end => (512 * 255 + 511) + 131072 * 2 = 393215 (OK) ... thread block 64 end => (512 * 255 + 511) + 131072 * 63 = 8,388,607 (OK) Each element is 66 bits long, so for each element in each thread block and for each parallel block: varDelimiterSize * varElementNumber thread block 1 start => 66 * [(512 * 0 + 0) + 131072 * 0] = 0 (OK) thread block 1 end => 66 * [(512 * 255 + 511) + 131072 * 0] = 8,650,686 (OK) thread block 2 start => 66 * [(512 * 0 + 0) + 131072 * 1] = 8,650,752 (OK) thread block 2 end => 66 * [(512 * 255 + 511) + 131072 * 1] = 17,301,438 (OK) thread block 3 start => 66 * [(512 * 0 + 0) + 131072 * 2] = 17,301,504 (OK) thread block 3 end => 66 * [(512 * 255 + 511) + 131072 * 2] = 25,952,190 (OK) ... thread block 64 end => 66 * [(512 * 255 + 511) + 131072 * 63] = 553,648,062 (OK) The math for calculating positon within 553,648,128 bit block allocated to kernel works then perfectly fine. However, when running the code with Nsight and memory checker enabled, this is all I get CUDA context created : 1dd4917bea0 CUDA module loaded: 1dd5dd77b20 kernel.cu CUDA grid launch failed: CUcontext: 2049925693088 CUmodule: 2050273803040 Function: _Z22subIntializeDelimitersPbiiiiii CUDART error: cudaLaunch returned cudaErrorLaunchFailure CUDART error: cudaDeviceSynchronize returned cudaErrorLaunchFailure CUDART error: cudaGetLastError returned cudaErrorLaunchFailure Here is the output from the cout ========================= GPU Device 0: "GeForce GTX 1080" with compute capability 6.1 Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used CUDA kernel launch (initialize delimiters) with 256 blocks and 512 threads per block, thread block multiplier 64 [B39,T32,BC0] - position start: 1320000 for element: 20000 [B39,T33,BC0] - position start: 1320066 for element: 20001 (... truncated, long output ... ) [B26,T157,BC0] - position start: 888954 for element: 13469 [B26,T158,BC0] - position start: 889020 for element: 13470 [B26,T159,BC0] - position start: 889086 for element: 13471 position start and element number calculations are correct and given there is not even direct memory access into memory allocated to kernel, there should be no overrun problem at all. However, kernel still returns launch failure for some reason. Any ideas what the issue might be? thanks !
I am back and still struggling with memory access errors for some reason. For varGpuBlockSize = 256 * 512 = 131072 (256 blocks are used, 512 threads per block), with varDelimiterSize = 66 (length of the bit sequence) and paralellisation factor varGpuBlockMultiplier = 64, I allocate 553,648,128 bits = 66MB of memory even.

bool * varSequenceDelimiter = NULL;
cudaMallocManaged(&varSequenceDelimiter, varGpuBlockMultiplier * varGpuBlockSize * varDelimiterSize * sizeof(bool));
subIntializeDelimiters <<<varGpuBlockCount, varGpuThreadCount>>> (varSequenceDelimiter, varDelimiterSize, varSpaceSizeReserved, varGpuBlockCount, varGpuThreadCount, varGpuBlockSize, varGpuBlockMultiplier);
cudaDeviceSynchronize();

the kernel is very simple, just going through individual thread block instances and parallel combinations of each thread block and just display parameters, not even modifying memory content.

__global__ void subIntializeDelimiters(bool * varSequenceDelimiter, int varDelimiterSize, int varSpaceSizeReserved, int varGpuBlockCount, int varGpuThreadCount, int varGpuBlockSize, int varGpuBlockMultiplier)
{
// repeat the process within each of the GPU blocks
// counting starts from 1, to account correctly for case of 1 GPU block
for (int varGpuBlock = 0; varGpuBlock < varGpuBlockMultiplier; varGpuBlock++)
{
// calculate the relative position start for this thread
unsigned long long int varElementNumber = varGpuThreadCount * blockIdx.x + threadIdx.x + varGpuBlockSize * varGpuBlock;
unsigned long long int varPositionStart = varDelimiterSize * varElementNumber;

printf("\n[B%d,T%d,BC%d] - position start: %llu for element: %llu", blockIdx.x, threadIdx.x, varGpuBlock, varPositionStart, varElementNumber);
}
}

there are 512 * 256 = 131072 thread blocks = thread number range then [0;131071] calculated as: varGpuThreadCount * blockIdx.x + threadIdx.x
thread number min => 512 * 0 + 0 = 0 (OK)
thread number max => 512 * 255 + 511 = 131071 (OK)

The varElementNumber (element number) is calculated in the function of thread block number, and paralellisation factor varGpuBlock: (varGpuThreadCount * blockIdx.x + threadIdx.x) + varGpuBlockSize * varGpuBlock. With paralleisation factor of 64, there are 131072 * 64 = 8,388,608 elements in total to examine, ranging [0;8,388,607].

thread block 1 start => (512 * 0 + 0) + 131072 * 0 = 0 (OK)
thread block 1 end => (512 * 255 + 511) + 131072 * 0 = 131071 (OK)
thread block 2 start => (512 * 0 + 0) + 131072 * 1 = 131072 (OK)
thread block 2 end => (512 * 255 + 511) + 131072 * 1 = 262143 (OK)
thread block 3 start => (512 * 0 + 0) + 131072 * 2 = 262144 (OK)
thread block 3 end => (512 * 255 + 511) + 131072 * 2 = 393215 (OK)
...
thread block 64 end => (512 * 255 + 511) + 131072 * 63 = 8,388,607 (OK)

Each element is 66 bits long, so for each element in each thread block and for each parallel block: varDelimiterSize * varElementNumber
thread block 1 start => 66 * [(512 * 0 + 0) + 131072 * 0] = 0 (OK)
thread block 1 end => 66 * [(512 * 255 + 511) + 131072 * 0] = 8,650,686 (OK)
thread block 2 start => 66 * [(512 * 0 + 0) + 131072 * 1] = 8,650,752 (OK)
thread block 2 end => 66 * [(512 * 255 + 511) + 131072 * 1] = 17,301,438 (OK)
thread block 3 start => 66 * [(512 * 0 + 0) + 131072 * 2] = 17,301,504 (OK)
thread block 3 end => 66 * [(512 * 255 + 511) + 131072 * 2] = 25,952,190 (OK)
...
thread block 64 end => 66 * [(512 * 255 + 511) + 131072 * 63] = 553,648,062 (OK)

The math for calculating positon within 553,648,128 bit block allocated to kernel works then perfectly fine. However, when running the code with Nsight and memory checker enabled, this is all I get

CUDA context created : 1dd4917bea0
CUDA module loaded: 1dd5dd77b20 kernel.cu
CUDA grid launch failed: CUcontext: 2049925693088 CUmodule: 2050273803040 Function: _Z22subIntializeDelimitersPbiiiiii
CUDART error: cudaLaunch returned cudaErrorLaunchFailure

CUDART error: cudaDeviceSynchronize returned cudaErrorLaunchFailure

CUDART error: cudaGetLastError returned cudaErrorLaunchFailure

Here is the output from the cout

=========================

GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used

CUDA kernel launch (initialize delimiters) with 256 blocks and 512 threads per block, thread block multiplier 64

[B39,T32,BC0] - position start: 1320000 for element: 20000
[B39,T33,BC0] - position start: 1320066 for element: 20001
(... truncated, long output ... )
[B26,T157,BC0] - position start: 888954 for element: 13469
[B26,T158,BC0] - position start: 889020 for element: 13470
[B26,T159,BC0] - position start: 889086 for element: 13471

position start and element number calculations are correct and given there is not even direct memory access into memory allocated to kernel, there should be no overrun problem at all. However, kernel still returns launch failure for some reason.

Any ideas what the issue might be?

thanks !

#10
Posted 12/03/2017 12:54 AM   
compile your code with -lineinfo switch then run your code with cuda-memcheck from a command line/command prompt inspect the output if necessary (ie. a kernel code issue) follow the methodology outlined here: [url]https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218[/url]
compile your code with -lineinfo switch
then run your code with cuda-memcheck from a command line/command prompt

inspect the output
if necessary (ie. a kernel code issue) follow the methodology outlined here:

https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218

#11
Posted 12/03/2017 01:02 AM   
thank you I followed this article (http://docs.nvidia.com/nsight-visual-studio-edition/3.2/Content/CUDA_Properties_Config.htm) and set Generate Line Number Information to Yes but then in compiler output I get this 1>CUDACOMPILE : nvcc warning : '--device-debug (-G)' overrides '--generate-line-info (-lineinfo)' I assume this means I need to disable --device-debug (-G) to have access to -lineinfo option? The referenced website indicates that "If Generate GPU Debug Information is on (-G), line information (-lineinfo) is automatically generated as well." so it seems it is already enabled and does not need to be set manually for the project. is that correct?
thank you

I followed this article (http://docs.nvidia.com/nsight-visual-studio-edition/3.2/Content/CUDA_Properties_Config.htm) and set Generate Line Number Information to Yes

but then in compiler output I get this

1>CUDACOMPILE : nvcc warning : '--device-debug (-G)' overrides '--generate-line-info (-lineinfo)'

I assume this means I need to disable --device-debug (-G) to have access to -lineinfo option? The referenced website indicates that "If Generate GPU Debug Information is on (-G), line information (-lineinfo) is automatically generated as well." so it seems it is already enabled and does not need to be set manually for the project. is that correct?

#12
Posted 12/03/2017 01:15 AM   
and the launch with cuda-memcheck produced this output ========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuTexRefSetAddress + 0x2cae4e) [0x2d849b] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\cudart64_90.dll (cudaDeviceSynchronize + 0x10e) [0x1b22e] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (subCalculate + 0x7a3) [0x5aa3] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (main + 0x48a) [0x698a] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (invoke_main + 0x34) [0x200d4] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main_seh + 0x127) [0x1ffc7] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main + 0xe) [0x1fe8e] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (mainCRTStartup + 0x9) [0x20169] ========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x11fe4] ========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6ef91] ========= ========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuTexRefSetAddress + 0x2cae4e) [0x2d849b] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\cudart64_90.dll (cudaGetLastError + 0x107) [0x1cf07] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (subCalculate + 0x7a8) [0x5aa8] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (main + 0x48a) [0x698a] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (invoke_main + 0x34) [0x200d4] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main_seh + 0x127) [0x1ffc7] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main + 0xe) [0x1fe8e] ========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (mainCRTStartup + 0x9) [0x20169] ========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x11fe4] ========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6ef91] ========= ========= ERROR SUMMARY: 2 errors
and the launch with cuda-memcheck produced this output

========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuTexRefSetAddress + 0x2cae4e) [0x2d849b]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\cudart64_90.dll (cudaDeviceSynchronize + 0x10e) [0x1b22e]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (subCalculate + 0x7a3) [0x5aa3]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (main + 0x48a) [0x698a]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (invoke_main + 0x34) [0x200d4]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main_seh + 0x127) [0x1ffc7]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main + 0xe) [0x1fe8e]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (mainCRTStartup + 0x9) [0x20169]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x11fe4]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6ef91]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaGetLastError.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuTexRefSetAddress + 0x2cae4e) [0x2d849b]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\cudart64_90.dll (cudaGetLastError + 0x107) [0x1cf07]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (subCalculate + 0x7a8) [0x5aa8]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (main + 0x48a) [0x698a]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (invoke_main + 0x34) [0x200d4]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main_seh + 0x127) [0x1ffc7]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main + 0xe) [0x1fe8e]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (mainCRTStartup + 0x9) [0x20169]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x11fe4]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6ef91]
=========
========= ERROR SUMMARY: 2 errors

#13
Posted 12/03/2017 01:16 AM   
If you are creating a debug build, you don't need to specify -lineinfo. The equivalent of -lineinfo is already included in a debug build. My best guess right now would be that your program is hitting the windows kernel timeout mechanism. Try reducing the size of the kernel until this issue goes away (e.g. reduce the number of blocks launched.) Then profile the code to estimate kernel duration. If it is long (more than half a second, or so), there's a good chance the larger/failing case is hitting the timeout. Debug build kernels run slower (usually) than release builds, so switching to a release build may drop your kernel execution time under the limit, if you are hitting that issue here.
If you are creating a debug build, you don't need to specify -lineinfo. The equivalent of -lineinfo is already included in a debug build.

My best guess right now would be that your program is hitting the windows kernel timeout mechanism.

Try reducing the size of the kernel until this issue goes away (e.g. reduce the number of blocks launched.) Then profile the code to estimate kernel duration. If it is long (more than half a second, or so), there's a good chance the larger/failing case is hitting the timeout.

Debug build kernels run slower (usually) than release builds, so switching to a release build may drop your kernel execution time under the limit, if you are hitting that issue here.

#14
Posted 12/03/2017 01:24 AM   
If I were to read this explicitly, the problem is with the cudaDeviceSynchronize function called immediately after kernel is launched subIntializeDelimiters <<<varGpuBlockCount, varGpuThreadCount>>> (varSequenceDelimiter, varDelimiterSize, varSpaceSizeReserved, varGpuBlockCount, varGpuThreadCount, varGpuBlockSize, varGpuBlockMultiplier); cudaDeviceSynchronize(); and then with the following cudaGetLastError call to confirm execution completed just fine - it is possible that execution of kernel just did not finish by the time cudeDeviceSynchronize was called and thus the error? // process any CUDA kernel errors cudaError_t varCudaError = cudaGetLastError(); if (varCudaError != cudaSuccess) { std::cout << "Failed to launch subIntializeDelimiters kernel (error code: " << cudaGetErrorString(varCudaError) << ")!" << std::endl; exit(EXIT_FAILURE); }
If I were to read this explicitly, the problem is with the cudaDeviceSynchronize function called immediately after kernel is launched

subIntializeDelimiters <<<varGpuBlockCount, varGpuThreadCount>>> (varSequenceDelimiter, varDelimiterSize, varSpaceSizeReserved, varGpuBlockCount, varGpuThreadCount, varGpuBlockSize, varGpuBlockMultiplier);
cudaDeviceSynchronize();

and then with the following cudaGetLastError call to confirm execution completed just fine - it is possible that execution of kernel just did not finish by the time cudeDeviceSynchronize was called and thus the error?

// process any CUDA kernel errors
cudaError_t varCudaError = cudaGetLastError();
if (varCudaError != cudaSuccess)
{
std::cout << "Failed to launch subIntializeDelimiters kernel (error code: " << cudaGetErrorString(varCudaError) << ")!" << std::endl;
exit(EXIT_FAILURE);
}

#15
Posted 12/03/2017 01:24 AM   
Scroll To Top

Add Reply