Basics re: memory & performance - total newbie to CUDA

First of all thanks to NVIDIA for the creation of CUDA.
This may be a total newbie question but you will soon understand
my confusion about the issues re: memory and threading in CUDA.

Below is an insert from someone else with an identical adapter to the one I have.
Namely the ZOTAC GeForce GTX 570.

The issues I have are the following;

A) The terminology!

Some state you can use 4 SMs at a time. Others have said 5.
Given the max 1536 threads per SM it appears as if only 1.5 blocks
can be used at full threading concurrently whereas 1536 threads may be scheduled?

The card per specs below state one may issue 1024 threads to one block but only 1536 to any one SM.

The number of Warps per SM is 32. So is the number of cores/SM.
Is this equivalent and thus ambiguous statements/terms?

The specs further state; “Run time limit on kernels: Yes”.
This to me is like saying “Q: When will we arrive at point X? A: Yes!”
In short, what is the timeout factor in s, ms or us alt. clockticks?

B) The number of maximum concurrent threads?
Is this 32x1.5 by the thread limit above or is it much more and if so, how much?

C) Finally. The Unified Memory.
On the card specified below, is this Unified Memory Max = L2 Cache Size (655360 bytes)
or is it = Total amount of constant memory: 65536 bytes
or is it = Total amount of shared memory per block: 49152 bytes

Many questions and very confusing with the near constant deviation in terminology used in the manuals and discussions so I for one hope any helpful soul can shed some light on this.

/Thanks in advance!

—GTX 570 specs—
CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 570”
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1280 MBytes (1342177280 bytes)
(15) Multiprocessors, ( 32) CUDA Cores/MP: 480 CUDA Cores
GPU Max Clock rate: 1560 MHz (1.56 GHz)
Memory Clock rate: 1900 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Answering myself on the question C: “Unified Memory”.

No such support is available at all on the given GPU above.
It appears also that SM_2.0 and SM_2.1 is depreceated in CUDA 8.0.
There does in fact not exist any unified memory on this GPU per my test run today.

(It is only available from GPU architecture Kepler or later.)

One more experience along the same lines as always in this industry.

SM (Streaming Multiprocessor) and multiprocessor are the same thing. The printout indicates 15 SMs. A CUDA kernel with enough blocks (15 or more) can use all available SMs.

The size of a block is not limited to just 1024 threads. For example a CUDA kernel could launch blocks of 512 threads each. 3 of those blocks could fully “occupy” the Fermi SM from GTX 570 (1536 threads total).

No they are not the same. Warp is defined in the programming guide. You may want to start reading it:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

a warp is a collection of 32 threads. CUDA cores are execution resources. Execution resources and threads are not the same thing.

This may vary by the OS and is controlled by the OS. For example in windows the exact number is determined by registry settings, and may change. Typically default values are around 2 seconds.

This would be the maximum thread complement per SM x the number of SMs, so 15 x 1536 for this particular GPU.

That card doesn’t support Unified Memory. You may want to start reading about Unified Memory in the programming guide linked above.

As per usual, thanks a million txbob.
I did dig in to the manual and suddenly noticed it was the wrong version but I am looking forward in due time to get a Kepler-ready GPU or better to get around the eternally more complex coding using memcpy.

The only issue I really have right now is the TTL of the threads est. at 2 seconds.
I had hoped to have an unlimited time to execute on my own system but I guess I can break this down into smaller chunks as opposed to let the code run through all of my N*10^M steps at once.

/Mike

You can modify the timeout to make it longer, including unlimited. However if the GPU is driving the display, that may not be a good idea. On windows, the easiest method (IMO) to modify the timeout is to use the method given in the nsight VSE manual:

[url]http://docs.nvidia.com/nsight-visual-studio-edition/Nsight_Visual_Studio_Edition_User_Guide.htm#Timeout_Detection_Recovery.htm[/url]

Thanks again txbob.

I was a bit hesitant but the thing I try to work out is closely related to “dynamic parallelism” as specified in the link below.

[url]http://developer.download.nvidia.com/assets/cuda/files/CUDADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf[/url]

It would be difficult to achieve my nestled calls in under 2 seconds without putting immense stress on the host code and then the whole purpose of parallelism is near lost.

I may have to modify my code accordingly and use the built in GPU on the MB for display and use the GeForce solely for CUDA somehow.

/Mike