Best solutions for work memory ?

Hi,

To avoid doing a XY Problem.

My problem:
I have a kernel that needs some memory to work. But the size of this memory depends on parameters determined during running time. My grids are 3D and my blocks are 2D.
How can I do it? What is the best solution?

First idea:
I simply thought to make a dynamic allocation in the kernel. But doing memory allocation (malloc) and memory free requires a lot of time.

Other idea:
To do a memory allocation of all the memory that will be required by every core.
But with this solution, I have no idea on how to know the physical core id in the kernel.

Does someone have any other (or better) idea?

Thank in avance

“But the size of this memory depends on parameters determined during running time”

meaning what exactly?
what is the variance of the parameters, and what do the parameters depend on?
to what extent can the parameters be known in advance, and at what point are the parameters fully resolved?

“To do a memory allocation of all the memory that will be required by every core.
But with this solution, I have no idea on how to know the physical core id in the kernel.”

a core being…?

Size of the memory for each kernel is x*y, with x<=w (width) and y <= h (height). The problem is that the size (x,y) are determined by each kernel. In fact, each kernel works on an specific sub-image.

What I mean by allocating the whole memory is an allocation of w*h for each kernel.

But with this solution, the problem is that w and h depend on the loaded image.

So we can only determine h and w after loading the picture and before having launch the CUDA code.

With core I mean an atomic computational unit.
Basically when I launch 1 000 thread (over the 10 000 I have to run) at the same time, each thread is on core, so a “core id” is < 1 000.

in-kernel malloc can be “slow” for a variety of reasons

I don’t think you have too many options here.

  1. Load the image, compute w and h, do a cudaMalloc based on that, and pass that pointer to the kernel for its workspace.

  2. Extract the code from the kernel that figures out what x and y should be for that kernel, run it first (in a separate, set-up kernel, perhaps, or on the host if it is simple), then do a single cudaMalloc on your x and y values, then launch the main kernel. Alternatively, come up with some upper-bound estimation method.

  3. Use in-kernel malloc or new. If you go this route, and can figure out a synchronization method, it will be beneficial to have one thread issue a single malloc that will then supply a pointer to the remaining threads and blocks for use, rather than have each thread issue a malloc operation for its own little piece. Don’t forget that the device “heap” used by in-kernel malloc is by default limited to 8MB and you may want to use API functions to increase this.

i would add the following option to an already comprehensive list:

if you do not allocate on the fly, as suggested above by you, then you need to preallocate
if (x <= w; y <= h), then an allocation of w,h would always be safe
if w,h is too big for your likening, perhaps then take better control of the implied kernel footprint via the kernel dimensions - launch multiple smaller blocks in separate streams to retain flow whilst reducing implied kernel footprint