Hello CUDA users,
We’ve just posted a new tool on the cuda site, the CUDA Occupancy Calculator. This tool is an MS excel spreadsheet that helps you choose thread block size for your kernel in order to achieve highest occupancy of the GPU. The CUDA Occupancy Calculator can be found on the CUDA homepage.
Here is a direct link: CUDA Occupancy Calculator
Hopefully this will answer your questions about occupancy, register file size, how register and shared memory usage affect efficiency, and how to tune your thread block size. The documentation for this tool follows, but it is also included in the excel spreadsheet on the “help” tab.
As we release new GPUs with different parameters, we’ll add them to this tool so it can be used for any GPU.
If you have questions about or problems with the CUDA Occupancy Calculator please post them in the forums.
Cheers,
Mark
------------------------------------ Documentation ------------------------------------------
Overview
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA thread programs. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail.
The size of N on G80 is 8192 32-bit registers per multiprocessor.
Maximizing the occupancy can help to cover latency during global memory loads that are followed by a __syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each thread block. Because of this, programmers need to choose the size of thread blocks with care in order to maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on shared memory and register requirements.
Instructions
Using the CUDA Occupancy Calculator is as easy as 1-2-3. Change to the calculator sheet and follow these three steps.
1.) First select your GPU in the green box.
2.) For the kernel you are profiling, enter the number of threads per thread block, the registers used per thread, and the total shared memory used per thread block in bytes in the orange block. See below for how to find the registers used per thread.
3.) Examine the blue box, and the graph to the right. This will tell you the occupancy, as well as the number of active threads, warps, and thread blocks per multiprocessor, and the maximum number of active blocks on the GPU. The graph will show you the occupancy for your chosen block size as a red triangle, and for all other possible block sizes as a line graph.
You can now experiment with how different thread block sizes, register counts, and shared memory usages can affect your GPU occupancy.
Determining Registers Per Thread and Shared Memory Per Thread Block
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the -cubin option to nvcc. This will generate a .cubin file, which you can open in a text editor. Look for the “code” section with your kernel’s name. Within the curly braces (“{ … }”) for that code block, you will see a line with “reg = X”, where x is the number of registers used by your kernel. You can also see the amount of shared memory used as “smem = Y”. However, if your kernel declares any external shared memory that is allocated dynamically, you will need to add the number in the .cubin file to the amount you dynamically allocate at run time to get the correct shareded memory usage. An example is below:
code {
name = my_kernel
lmem = 0
smem = 24
reg = 5
bar = 0
bincode { … }
const { … }
}
Let’s say “my_kernel” contains an external shared memory array which is allocated to be 2048 bytes at run time. Then our total shared memory usage is 2072 bytes. We enter this into the box labeled “shared memory per block (bytes)”, and we enter the number of registers used by my_kernel, 5, in the box labeled registers per thread. We then enter our thread block size and the calculator will display the occupancy.
For more information on NVIDIA CUDA, visit [url=“http://developer.nvidia.com/cuda”]http://developer.nvidia.com/cuda[/url]