[SOLVED] Code his own shared memory with device memory!

Hello!

I’m a beginner in CUDA, and new in this forum. (sorry, my english is pretty poor)

I’m looking for a concept-like Shared Memory in CUDA, but with device memory.

My kernel function need huge memory amount to compute his result. Something like 5000 double per thread. This memory is used only in thread execution, i don’t need the data stored in it.

Actually all this memory is allocated before the kernel call in a big cudaMalloc. But in this way, the vRAM of my Titan-X is full (like 11/12 Go). And i will need soon more space.

Well, the easy answer is to device malloc inside each thread execution, but it’s hit really hard the execution time.

So, i was thinking about a way to allocate only the memory needed for all the running thread only. Exactly like Shared Memory, but handled in code, because Shared memory Size per block thread is not enough.

Go from

size_t nb_total_thread = 1500000;
size_t nb_double_required_per_thread = 5000;
size_t nb_compute_double = nb_total_thread * nb_double_required_per_thread;
double *compute_double;

cudaMalloc((void **) &compute_double, nb_compute_double * sizeof(double));

To something like this

size_t nb_total_thread_per_block = 1024;
size_t nb_double_required_per_thread = 5000;
size_t max_block_executed_at_the_same_time = 8; //(I don't already know how to get that info)

size_t nb_compute_double = nb_total_thread_per_block
      * nb_double_required_per_thread * max_block_executed_at_the_same_time;

double *compute_double;

cudaMalloc((void **) &compute_double, nb_compute_double * sizeof(double));

I assume that all thread of a block have to end their task before a new block start.
Is it a viable solution? I don’t know yet how i will find where the current thread can use the memory, but i think it’s possible. Current kernel look like

__global__
void my_kernel(double *compute_double)
{
    int index = threadIdx.x + blockIdx.x*blockDim.x;
        // i will try to find a way to have a smart index to not overlap on double
        // allready used by other running blocks
    double *thread_compute_double = compute_double + index;
    // do some stuff here with compute double like store 
    // intermediate results and multiply them
}

I don’t know if i’m very concise.

It should be possible to reduce temporary memory usage in the way you are suggesting. However it will involve concepts that would not typically be used at the beginner level in CUDA.

You will need to ascertain (somehow) the maximum carrying capacity (instantaneous capacity) of your device in terms of blocks. You could do this analytically with e.g. the occupancy calculator, experimentally with e.g. a profiler, or experimental code you write yourself, or perhaps programmatically with the occupancy API.

Once you have established the maximum instantaneous capacity (the maximum number of blocks that can be executing, which will likely be your SM count times some number like 2, 3, or 4), then you will need to allocate whatever temporary space that implies (blocks*threads per block * temporary space per thread), and if you are following the “experimental” approach above you may want to allocate a little bit more.

Finally, as you indicate, you will need a method to allocate a temporary “chunk” to each thread (or each block) as it becomes resident and goes into execution. I would suggest doing this on a block-by-block basis. One approach that comes to mind would be to allocate a sub-chunk per SM, and then use the SMID (can be retrieved using an intrinsic, or inline assembly) to allow each block to idenitify the sub-chunk it will work out of. This still leaves open the question of which block in the SM space am I. I think you could come up with a number of different exclusive enumeration schemes, probably using atomics, to assign a unique ID per block in the “SM space” to each block that is executing.

Thanx for your answer txbob.

That help’s me a lot.

(Forget it, i read you too fast).(First, i don’t understand really why you said i will only got 2, 3 or 4 SM. I assimilate it with the number of my multiprocessor, am i wrong?)

I wrote a code to get theoretical maximum value of my grid and the maximum thread running at the same time. I don’t take in parameter the number of assembly instruction, maximum stack allowed/thread, etc…
That’s how i will guess the maximal amount of memory.

I think i will figure out how to identify where each thread can work in the memory.

Can you tell me if my reasoning is ok? (I’m sorry this is some verbose code)

typedef struct cudaDeviceProp t_cuda_prop;

void	set_device_prop(t_cuda_prop *prop)
{
	int		device;
	
	cudaGetDevice(&device);
	cudaGetDeviceProperties(prop, device);
}

int		get_nb_thread_per_block(t_cuda_prop *prop)
{
	int	nb_thread;

	nb_thread = prop->maxThreadsPerBlock;
	while (prop->maxThreadsPerMultiProcessor % nb_thread
		&& nb_thread > prop->warpSize)
		nb_thread -= prop->warpSize;
	return (nb_thread);
}

void	get_kernel_launch(t_cuda_prop *prop, int nb_task, int *nb_block, int *block_size)
{
	printf("Nb %d processor\n", prop->multiProcessorCount);
	printf("Each processor can run %d thread\n", prop->maxThreadsPerMultiProcessor);
	printf("Each block can run %d thread\n", prop->maxThreadsPerBlock); 
	printf("Warps are %d thread large\n", prop->warpSize);
	*block_size = get_nb_thread_per_block(prop);
	*nb_block = nb_task / *block_size + 1;
	printf("== Launch kernel with <<<%d, %d>>>\n", *nb_block, *block_size);
}

int		nb_concurrent_thread(t_cuda_prop *prop, int block_size)
{
	int nb_block_per_processor;
	int	nb_thread_per_processor;
	int ret; 

	nb_block_per_processor = prop->maxThreadsPerMultiProcessor / block_size;
	nb_thread_per_processor = nb_block_per_processor * block_size;
	ret = prop->multiProcessorCount * nb_thread_per_processor;
	printf("Simultaneaous %d thread\n", ret);
	return (ret);
}

extern "C"
void cuda_foo()
{
	t_cuda_prop	prop;
	size_t		memory_per_thread = sizeof(double) * 4000;
	size_t		require_threads = 1500000;
	int			block_dim;
	int			thread_dim;
	int			max_thread;

	set_device_prop(&prop);
	get_kernel_launch(&prop, require_threads, &block_dim, &thread_dim);
	max_thread = nb_concurrent_thread(&prop, thread_dim);
	printf("Maximum memory in use will be %lu octets\n", max_thread * memory_per_thread);
}

With my Titan-X i got this output :

Nb 24 processor
Each processor can run 2048 thread
Each block can run 1024 thread
Warps are 32 thread large
== Launch kernel with <<<1465, 1024>>>
Simultaneaous 49152 thread
Maximum memory in use will be 1572864000 o

I didn’t say you will only get 2, 3, or 4 SM. I intended to say that the total number of blocks that are resident will be:

(the number of SMs in your GPU) * (some other number like 2,3, or 4)

The “some other number” is just how many blocks of your kernel will be resident on each SM, which I don’t know so I am just guessing at. It should be a number between 1 and 16.

Generically, (number of SMs per GPU) * (number of concurrently executing thread blocks per SM), where the latter is limited by resource constraints (e.g. number of registers, size of shared memory used by each thread block) and by the absolute maximum of concurrent SMs supported by a given GPU architecture. The absolute architecture limits are documented in an appendix to the CUDA Programming Guide, the limits derived from resource usage we can’t know without knowing the resource usage of the kernel. You can use the occupancy calculator to find out.

Your code sample suggests that the size of the per-thread arrays is known at compile time. In this case it would be simpler to just use automatic variables, which get allocated on the (device) stack:

__global__
void my_kernel(double *compute_double)
{
    double big_array[5000];
    // do some stuff here with compute double like store 
    // intermediate results and multiply them
}

This may require adjustment of the device (per-thread) stack size via a call to cudaDeviceSetLimit() on the host.

I think i have enough information to experiment a few solution. Btw, the amount of memory is not known at compilation, it’s dynamic and can be change at each call of the kernel. Thank you all.

Well i was trying to adapt the idea from txbob, to allocate bunch of memory dedicated to each SM, then splitting them between each running block on this sm. But i’m stuck because the CUDA developer guide says the smid is volatile (it could change during the kernel execution!) and it’s not safe to rely on it…

Tying the allocations to SMs is probably not necessary as long as you can establish the maximum instantaneous capacity of your kernel in blocks. You will then need a method by which spawning threadblocks can request a storage chunk ID and retiring threadblocks can release their storage chunk ID. This conceptually doesn’t matter too much whether you perform it at the SM level, where each SM needs to manage a pool of say 4-16 storage chunks, or at the device level (to remove any dependence on SMID) where you have one single mechanism that manages a pool of say 256 storage chunks.

You might use a method involving (global) atomics to request and release a storage chunk ID.

Well, i did it. This look like crappy code and it hits hard the execution time.

block_table is an integer array initialized at 0. The thread with an id of 0 in each block look in this array to find an available place. This part is so ugly omg. I will try on my real code in the week to find how much that eats time, but it seems heavy, i’m really disappointed.

Then the index available is shared between all thread through an integer in shared memory.
(of course, at the end of the execution of the entire block, i set back the index to zero)

Again, thx txbob

__device__
void		lock_block_memory(int max_block, int *block_table, int *shared_index)
{
	if (!threadIdx.x)
	{
		*shared_index = 0;
		while (1)
		{
			if (!block_table[*shared_index])
			{
				atomicAdd(block_table + *shared_index, 1);
				if (block_table[*shared_index] == 1)
				{
					printf("Block id = %d LOCK block_table[%d]\n", blockIdx.x, *shared_index);
					break ;
				}
				atomicSub(block_table + *shared_index, 1);
			}
			(*shared_index) += 1;
			if (*shared_index >= max_block)
				*shared_index = 0;
		}
	}
	__syncthreads();
}

I really think the proper way to do that is with sub bunch associate to each SM, because the fight to get an available index is beetween 8 block max (in real it will be more like 4 max).
Indeed, my blocks are usually large of 1024 thread, the minimum is 256 in a few case. And a SM can run 2048 thread max (so 2-8 block).
With the titan-x in my test there were 46 blocks struggling at the same time to get memory, i guess this is not negligible.

You may not want to do it this way:

atomicAdd(block_table + *shared_index, 1);
				if (block_table[*shared_index] == 1)

This can lead to a “forever” race condition, i.e. deadlock. If two parties both do the atomicAdd before either reads the block_table location, then they will both read 2, and both will back off. But in reality one of them was first and should have won the resource.

The return value of the atomicAdd operation gives the old value. For a given requester, if it was zero, then that requester can be confident that they “won” the resource bidding. I would suggest this instead:

int ret = atomicAdd(block_table + *shared_index, 1);
				if (ret == 0)

I think you may also want to mark both the shared_index and block_table as volatile pointers.

I believe there may be a possibly better approach (circular queue) using atomicInc (or atomicDec).

Suppose I have a known maximum threadblock support of 48 (say, 4 threadblocks * 12 SMs).

Then allocate, say 60 chunks, and initially populate a 64-entry table with the chunk ID’s:

0 1 2 3 4 ... 59 -1 -1 -1 -1
^                 ^
|                 |
rqi               rli

We’ll define two indices. A request index (rqi) and a release index (rli). The request index will initially “point” to the first entry in the table (it will be an index of 0), and the release index will initially “point” to the first “-1” value in the table. (The starting value for rli in this example would be 60).

When a chunk is requested, the requesting thread will perform an atomicInc on rqi, with a rollover value of 64. The requesting thread will then use the return value from the atomicInc operation as an index into the chunk ID table. Using that index, it will retrieve the chunk ID, and (for safety) replace that chunk ID with -1.

When a threadblock is ready to release its chunk, it will perform an atomicInc (again, with rollover of 64) on the rli index, and it wall take the return value of that atomicInc operation and use it as an index into the chunk ID table. That index should always be pointing to a -1 value, and the retiring threadblock will take the chunk ID that it retrieved at request time, and place it into the table at the index location specified by the return value of the atomicInc operation.

Something like this:

#define TSIZE 63
__device__ volatile int chunkID[TSIZE+1];
__device__ unsigned int rqi = 0;
__device__ unsigned int rli = TSIZE-3;

setup/preamble, in host code:

int htable[TSIZE+1];
for (int i=0; i < TSIZE-3; i++) htable[i] = i;
for (int i=TSIZE-3; i < TSIZE+1; i++) htable[i] = -1;
cudaMemcpyToSymbol(chunkID, htable, (TSIZE+1)*sizeof(int));

device code:

__device__  int request_chunk(){
  int ret = atomicInc(&rqi, TSIZE);
  int my_chunk = chunkID[ret];
  chunkID[ret] = -1;
  __threadfence();
  return my_chunk;
}

__device__  int release_chunk(int myID){
  int ret = atomicInc(&rli, TSIZE);
  if (chunkID[ret] != -1) return -1;
  chunkID[ret] = myID;
  __threadfence();
  return 0;
}

In each case, a -1 return value indicates an error. You will still need to condition the request and release on thread 0 in each block, just as you have done. I have written a simple test of this; it seems to work.

#include <stdio.h>
#include <assert.h>

#define TDELAY 1000000ULL
// TSIZE will need to be modified depending on the specific kernel and device tested
#define TSIZE 28
__device__ volatile int chunkID[TSIZE+1];
__device__ unsigned int rqi = 0;
__device__ unsigned int rli = TSIZE-3;

__device__  int request_chunk(){
  int ret = atomicInc(&rqi, TSIZE);
  int my_chunk = chunkID[ret];
  chunkID[ret] = -1;
  __threadfence();
  return my_chunk;
}

__device__  int release_chunk(int myID){
  int ret = atomicInc(&rli, TSIZE);
  if (chunkID[ret] != -1) return -1;
  chunkID[ret] = myID;
  __threadfence();
  return 0;
}

__global__ void tkern(){

  if (!threadIdx.x){
    int my_chunk;
    my_chunk = request_chunk();
    assert (my_chunk != -1);
    unsigned long my_start = clock64();
    while (clock64() < my_start+TDELAY);
    assert(release_chunk(my_chunk) != -1);
    }
#ifdef TEST_SYNC
    __syncthreads();
#endif
}

int main(){

  int htable[TSIZE+1];
  for (int i=0; i < TSIZE-3; i++) htable[i] = i;
  for (int i=TSIZE-3; i < TSIZE+1; i++) htable[i] = -1;
  cudaMemcpyToSymbol(chunkID, htable, (TSIZE+1)*sizeof(int));
  tkern<<<4096,256>>>();
  cudaDeviceSynchronize();
  return 0;
}

txbob this look brilliant.
I’m not comfortable with the keyword volatile. Well, i understand why i should use it with a shared memory variable. And i still use syncthread instead of _threadfence, so i need to read a few document on these two.

Btw, using insert in device code was only for testing purpose i guess? I’m not familiar either with it, i always prefer use a simple if.

If i have time today, i will try and adapt youre solution and bench it. (E.g. i will probably need to transform the globale chunkID array as a dynamic array) I keep you in touch.

Hello there.

I did implement you’re way.
The chunkID is now a dynamic array, allocated before the kernel launch. I take the theoretical maximum number of concurrent block, and round up this number to the next power of 2. So the number of sub chunk is not the exact number of concurrent block, but this looks safer. For 4000 double per float i allocate 2000 mo of “custom shared memory”. My way use 500 mo less.

I use a dummy kernel to set dynamically the index rli before the main computation. The storage chunk id is still an integer stored in shared memory.

Moreover, i did use __syncthread instead of __threadfence, it looks more appropriate in my code.

So i did a few test, i will call my way, “the naive way”, txtbob’s way “the rollover way”, and the no memory management. I did each test 10 times, and made an average of time computation (+ selection of the chunk id). It’s not including the allocation time!

This is the code executed between the selection of the chunk id and its release.

// get_thread_memory return a void * ptr to the current thread memory
// global memory is a void * ptr who contains the big cudaMalloc
//    with all the require memory for all concurrent thread /blocks
// s_i[0] is the storage chunk id (in a shared memory)
// storage is the memory where the result of the computation is stored
double * my_custom_ptr = (double *)get_thread_memory(global_mem, memory_per_thread, memory_per_block, s_i[0]);
int i = 0;
my_custom_ptr[0] = thread_index;
while (++i < NB_DOUBLE_PER_THREAD)
   my_custom_ptr[i] = i;
i = -1;
storage[thread_index] = 0;
while (++i < NB_DOUBLE_PER_THREAD)
   storage[thread_index] += my_custom_ptr[i];

Summary of the tests:

  1. Number of total thread = 1.5 million, block dim = 1024 thread, 1000 double compute for each thread
  • No memory management : 1355.794 ms (and 11.4 giga of memory taken)
  • Naive way : 957.602ms
  • Rollover way : 949.702 ms
  1. Number of total thread = 1.5 million, block dim = 1024 thread, 4000 double compute for each thread
  • No memory management : OUT OF MEMORY
  • Naive way : 6309.100 ms
  • Rollover way : 6324.266 ms
  1. Number of total thread = 5 million, block dim = 1024 thread, 1000 double compute for each thread
  • No memory management : OUT OF MEMORY
  • Naive way : 3184.195 ms
  • Rollover way : 3179.07 ms
  1. Number of total thread = 5 million, block dim = 1024 thread, 4000 double compute for each thread
  • No memory management : OUT OF MEMORY
  • Naive way : 21101.849 ms
  • Rollover way : 21126.92 ms

I don’t fucking understand how the way without memory management is so slow. In both other ways, i got 2 _syncthread in the kernel, and i guess the time to select and release the chunkID is non negligible. I really don’t get it.

Bob’s way and mine are worth equally i think. The rollover way is safer i think, but use more memory (200mo vs 1500mo). During all the test i did not get timeout error, so both way avoid race condition. In some case the naive way is the fastest but on only 10 test i don’t think i can take it in factor.

I can’t choose one of them so i will adapt the 2 prototype in the real code and bench it again (more work to do…)

What do you think about these test txbob?

Assuming you have conditioned the request/release functions on thread 0, then switching from __threadfence() to __syncthreads() creates illegal code.

It’s covered in the documentation on syncthreads. Other than that I don’t have any comments.

Sure, the __syncthread is outside of the (if threadID = 0) statement
Ok thanks again txbob for your time