malloc memory in kernel linked via in/out variable

multithreadedRgm · October 12, 2015, 2:18pm

This program is just a test to see if I can malloc an array on device in a kernel, to have that new array outlive the kernel, and be transferred back to the host.

I try it with both malloc and cudaMalloc on device. At least with malloc, I can successfully read while on the device a value that was assigned to the malloc’d array, whereas cudaMalloc on device doesn’t seem to work at all - I can’t even successfully read on the device a value that I assigned to it. Using cuda-memcheck, it says the 2nd cudaMemcpyAsync fails, even though the addresses of the source device-malloc’d arrays (retVal[0] and matrix_dev) match when printed out.

#include <stdio.h>
__global__ void testRetVal(float **retVal) {
  retVal[0] = (float*)malloc(sizeof(float)*5);
  // cudaMalloc((void**)(&(retVal[0])), 5*sizeof(float));                                                                                       
  retVal[0][0] = 0;
  retVal[0][1] = 1;
  retVal[0][2] = 2;
  retVal[0][3] = 3;
  retVal[0][4] = (float)4;
  printf("retVal[0][4] = %f\n", retVal[0][4]);
  printf("retVal[0] = %llx\n", retVal[0]);
}

int main(int argc, char **argv) {
  float **retVal;
  cudaMalloc(&retVal, sizeof(float*));
  cudaStream_t strm;
  cudaStreamCreate(&strm);
  testRetVal<<<1,1,0,strm>>>(retVal);
  float *matrix_dev;
  float matrix[5];
  cudaMemcpyAsync(&matrix_dev, retVal, sizeof(float*),
                  cudaMemcpyDeviceToHost, strm);
  printf("matrix_dev = %llx\n", matrix_dev);
  cudaMemcpyAsync(matrix, matrix_dev, sizeof(float)*5,
             cudaMemcpyDeviceToHost, strm);
  cudaStreamSynchronize(strm);
  printf("matrix[4] = %f\n", matrix[4]);
  return 0;
}

Robert_Crovella · October 12, 2015, 2:37pm

You cannot use pointers allocated via in-kernel malloc as part of host-based cudaMemcpy operations.
In kernel cudaMalloc has a similar limitation.

So if you want a memory region to be able to participate in a host-based cudaMemcpy* operation, then allocate it with the host-based API.

multithreadedRgm · October 12, 2015, 2:51pm

What about the other problem, that if you allocate with cudaMalloc in the kernel, the kernel printf statement doesn’t correctly print the assigned value. I compiled with “nvcc -arch=sm_35 prog.cu”.

Robert_Crovella · October 12, 2015, 5:16pm

This appears to be a limitation at the current time of the usage of the runtime API in-kernel. I’m still investigating.

In the meantime, there are at least two possible workarounds:

use in-kernel malloc instead of in-kernel cudaMalloc
Actually include a child-kernel launch in your code, triggering compilation of the CDP mechanisms. In that case, you will also want to add the necessary switches for CDP compilation, including -rdc=true and -lcudadevrt (for your example case).

multithreadedRgm · October 17, 2015, 12:49pm

@txbob, I couldn’t figure out how to get your workaround option #2 above to work, after trying a few different things relating to compilation with -rdc=true, -dc, and -dlink, and using a child-kernel launch.

I still think that as the documentation states it, the device malloc or device cudaMalloc should be allocating global memory that could outlive the kernel that allocated it, and that memory could be used in later kernels within the same host application, provided that the pointer to the memory is passed back to the host. Here was a related thread, which provides another example of such allocation:

https://devtalk.nvidia.com/default/topic/413183/dynamic-memory-allocation-during-kernel-execution-is-it-posible-/

I’m not sure any of that applies to compute capability 3.5, though, which is what I am using.

Maybe the architects aren’t supporting such allocation anymore for compute capability 3.5, for unavoidable efficiency reasons (that would be ok), I don’t know, but then the documentation should state that somehow.

Thanks for any more thoughts on this.

Robert_Crovella · October 17, 2015, 2:21pm

To be clear, let’s separate the notion of using a device-allocated pointer (whether from malloc or cudaMalloc, which is claimed in the documentation to be a thin wrapper around device malloc) in a host-based cudaMemcpy-type operation, from the notion of whether or not cudaMalloc can be made to work correctly in-kernel. My “workaround #2” was suggested to show that cudaMalloc can be used in-kernel, not that the pointer returned could participate in a host-based cudaMemcpy-type operation (it cannot).

Here is a worked example of the notion that cudaMalloc can be made to work in kernel, in other words addressing your question beginning with “What about the other problem…”:

$ cat t949.cu
#include <stdio.h>
#include <assert.h>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void dummy_child_kernel(){
}

__global__ void testRetVal(float **retVal) {
  //retVal[0] = (float*)malloc(sizeof(float)*5);
  cudaError_t ret = cudaMalloc((void**)(&(retVal[0])), 5*sizeof(float));
  assert(ret == cudaSuccess);

  retVal[0][0] = 0;
  retVal[0][1] = 1;
  retVal[0][2] = 2;
  retVal[0][3] = 3;
  retVal[0][4] = (float)4;
  printf("retVal[0][4] = %f\n", retVal[0][4]);
  printf("retVal[0] = %llx\n", retVal[0]);
  dummy_child_kernel<<<1,1>>>();
}

int main(int argc, char **argv) {
  float **retVal;
  cudaMalloc(&retVal, sizeof(float*));
  cudaStream_t strm;
  cudaStreamCreate(&strm);
  testRetVal<<<1,1,0,strm>>>(retVal);
  float *matrix_dev;
//  float matrix[5];
  cudaMemcpyAsync(&matrix_dev, retVal, sizeof(float*), cudaMemcpyDeviceToHost, strm);
  cudaStreamSynchronize(strm);
  printf("matrix_dev = %llx\n", matrix_dev);

  // cudaMemcpyAsync(matrix, matrix_dev, sizeof(float)*5, cudaMemcpyDeviceToHost, strm);
  //cudaStreamSynchronize(strm);
  //printf("matrix[4] = %f\n", matrix[4]);
  cudaCheckErrors("some error");
  return 0;
}
$ nvcc -arch=sm_35 -rdc=true -lcudadevrt -o t949 t949.cu
$ ./t949
retVal[0][4] = 4.000000
retVal[0] = 402a3f920
matrix_dev = 402a3f920
$

Notes:

If you comment out the child kernel launch, the first assert will be hit.
Note the compile command; you have to run CDP codes on a cc3.5 or higher device.
Yes, you can and should do proper error checking in-kernel, for any runtime API usage. Also, it’s suggested that when using in-kernel malloc, you test the returned pointer for NULL, as an error-checking method. A device allocation failure will return a NULL pointer, just as host-side malloc does.

That is true, I don’t believe I ever said otherwise, and as far as I can tell, that is the first time that particular concept has popped up in this thread. What I said was the in-kernel dynamically allocated pointer cannot participate in a host-based cudaMemcpy-type operation. But the pointer (and the data it points to) can live from one kernel launch to the next, until it is explicitly freed (using in-kernel free, or cudaFree, corresponding to the original allocation), or until application/process termination.

In fact the pointer need not be “passed back to the host”, as long as you pass the same double-pointer (**retVal) to your next kernel.

multithreadedRgm · October 17, 2015, 4:23pm

Ok, thanks, I got that example to work. But, firstly,

It seems appropriate here to ask “why not?” Is there some deeper reason why it isn’t supported to use the newly-allocated pointer to device global memory, to cudaMemcpy that array back to the host?

Second, I’ve reduced the problem using device cudaMalloc (as documented in [1]) somewhat further, to show that, strangely, it seems to only work when either a dummy_child_kernel is called, or if cudaFree is called in another compiled kernel, even though that kernel isn’t even called.

#include <stdio.h>
#include <assert.h>

#define cudaCheckErrors(msg)                    \
  do { \
    cudaError_t __err = cudaGetLastError(); \
    if (__err != cudaSuccess) { \
      fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
              (msg), cudaGetErrorString(__err),          \
              __FILE__, __LINE__); \
      fprintf(stderr, "*** FAILED - ABORTING\n"); \
      exit(1); \
    } \
  } while (0)

__global__ void dummy_child_kernel() {}

__global__ void testRetValKernel(float **retVal) {
  // retVal[0] = (float*)malloc(sizeof(float)*5);                                                                                                                                                                                                                      
  cudaError_t err = cudaMalloc((void**)(&(retVal[0])), 5*sizeof(float));
  assert(err == cudaSuccess);
  retVal[0][0] = 0;
  retVal[0][1] = 1;
  retVal[0][2] = 2;
  retVal[0][3] = 3;
  retVal[0][4] = (float)4;
  printf("retVal[0][4] = %f\n", retVal[0][4]);
  printf("retVal[0] = %llx\n", retVal[0]);
  // dummy_child_kernel<<<1,1>>>();                                                                                                                                                                                                                                    
}

__global__ void freeMemKernel(float *ptr) {
  //  free(ptr);                                                                                                                                                                                                                                                       
  cudaFree(ptr);
}

int main(int argc, char **argv) {
  float **retVal;
  cudaStream_t strm;
  float *matrix_dev;
  // float matrix[5];                                                                                                                                                                                                                                                  
  cudaMalloc(&retVal, sizeof(float*));
  cudaStreamCreate(&strm);
  testRetValKernel<<<1,1,0,strm>>>(retVal);
  cudaStreamSynchronize(strm);
  cudaMemcpyAsync(&matrix_dev, retVal, sizeof(float*),
                  cudaMemcpyDeviceToHost, strm);
  // cudaMemcpyAsync(matrix, matrix_dev, sizeof(float)*5,                                                                                                                                                                                                              
  //         cudaMemcpyDeviceToHost, strm);                                                                                                                                                                                                                            
  // freeMemKernel<<<1,1,0,strm>>>(matrix_dev);                                                                                                                                                                                                                        
  cudaStreamSynchronize(strm);
  cudaCheckErrors("some error");
  printf("matrix_dev = %llx\n", matrix_dev);
  // printf("matrix[4] = %f\n", matrix[4]);                                                                                                                                                                                                                            
  return 0;
}

I compiled the above using “nvcc -arch=sm_35 -rdc=true prog.cu”. I believe -lcudadevrt might be automatically implied when using nvcc for linking.

[1] http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY

Robert_Crovella · October 17, 2015, 5:00pm

I can’t give you a detailed answer. The memory associated with host side cudaMalloc is a logically separate space from the memory that is the “device heap”, i.e. the memory space that in-kernel malloc (or in-kernel cudaMalloc) carves its allocations out of. You can get a hint of this separation, although not a detailed explanation for it, if you read the documentation (Programming Guide, dynamic memory allocations).

Yes, I already indicated that in my response. The observation about cudaFree in another kernel is interesting, but I think merely confirms my point (later) that usage of the cuda runtime API in-kernel, in an app that otherwise makes no use of CDP, is suspect.

That is incorrect. If you study any of the cuda sample codes that use the runtime API in-kernel, they all explicitly specify -lcudadevrt in the makefiles. If you delete the -lcudadevrt from any of those makefiles, compilation will fail. (In fact, just delete the -lcudadevrt from the example I gave, and see what happens.) What we’re talking about here, where there is an attempt to use the cuda runtime API, in a kernel/app that otherwise makes no use of CDP, is a special case (in case that’s not obvious already). Yes, this special case can be linked without -lcudadevrt. However the usage of the cuda runtime API in this fashion, as I’ve already indicated, is suspect, at least based on your original example, or my example with the child kernel launch commented out.

If you have no intention of using CDP, my suggestion is to use in-kernel malloc (and free) rather than cudaMalloc (and cudaFree).

If you are using CDP, then most of this thread is moot. The cuda runtime API works as advertised, with respect to the programming guide section on CDP.

I have already filed a bug internally at NVIDIA to discuss this usage of the cuda runtime API in-kernel in an app that otherwise makes no usage of CDP. I suspect something will eventually come of that, and therefore the specific behavior here may possibly change in the future, since this behavior, in my opinion, is squirrely.

You can insulate yourself from any such change, if it were to occur in the future, by not using the cuda runtime API (in-kernel) in any app that otherwise makes no usage of CDP. (To me, it appears to be broken, anyway, although usage of proper cuda error checking, which is always recommended with usage of the cuda runtime API, should prevent anyone from going astray here.) I don’t think there is any necessary use case for such, anyway. Certainly with respect to cudaMalloc/cudaFree, the suggestion is to use malloc/free instead.

multithreadedRgm · October 17, 2015, 6:17pm

Thanks for the great answer, txbob.

However, you might expect to read similar future complaints from programmers who find it convenient to host cudaMemcpy(DeviceToHost) from arbitrary locations that weren’t host cudaMalloc’d. In addition to issues with device usage of the runtime API, there’s probably something wrong even with parts of the host runtime API, as host cudaMemcpy + friends don’t seem to work with source pointers that weren’t allocated by host cudaMalloc. I think that is specifically what should should be fixed in either the implementation and/or documentation of host runtime API functions.

Robert_Crovella · October 17, 2015, 7:30pm

My suggestion is that if you have recommendations for how to improve CUDA, that you file a (bug) feature request at the developer portal developer.nvidia.com

multithreadedRgm · October 17, 2015, 10:30pm

Ok, I didn’t see that link before… done. Thank you!