changing L1 cache configuration using “cudaFuncSetCacheConfig" not working

Hi,

I am trying to change L1 cache configuration using the Runtime API call “cudaFuncSetCacheConfig" as specified in the Programming Guide.

However, it does not seem to be working. My Dummy code is below

// Dummy code

//Cuda Device Properties

cudaDeviceProp deviceProp;

__global__ void MyKernel(....) {

....

}

void host_func() {

CUDA_SAFE_CALL_NO_SYNC(cudaGetDeviceProperties(&deviceProp, dev));

// Printing Device properties before changing L1 cache configuration

	printf("  Total amount of global memory:                 %llu bytes\n", (unsigned long long) deviceProp.totalGlobalMem);

        printf("  Total amount of constant memory:               %u bytes\n", deviceProp.totalConstMem);

        printf("  Total amount of shared memory per block:       %u bytes\n", deviceProp.sharedMemPerBlock);

// Changing L1 cache configuration to 48KB - L1 and 16KB - Shared Memory

	cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferL1);

        printf("/nMSG - Configured L1 cache to 48KB size/n");

// Printing Device properties after changing L1 cache configuration. This should print SM = 16 KB

	printf("  Total amount of global memory:                 %llu bytes\n", (unsigned long long) deviceProp.totalGlobalMem);

        printf("  Total amount of constant memory:               %u bytes\n", deviceProp.totalConstMem);

        printf("  Total amount of shared memory per block:       %u bytes\n", deviceProp.sharedMemPerBlock);

}

The output of the above code which I get -

Using device 0: GeForce GTX 480

  Total amount of global memory:                 1576468480 bytes

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

MSG - Configured L1 cache to 48KB size

Using device 0: GeForce GTX 480

  Total amount of global memory:                 1576468480 bytes

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

I don’t understand where I am going wrong.

Thanks and Regards,

Ninad B Kothari

At the very minimum you need to call cudaGetDeviceProperties() again to actually see the result of the change…

I wouldn’t however be surprised if the change weren’t immediately reflected before the cache configuration has actually changed on the GPU, or even if it wouldn’t change at all.

Hi tera you are right … and I apologize for not posting my dummy code correctly. Let me paste the correct code below

// Dummy code

//Cuda Device Properties

cudaDeviceProp deviceProp;

__global__ void MyKernel(....) {

....

}

void host_func() {

queryDevice();

// Changing L1 cache configuration to 48KB - L1 and 16KB - Shared Memory

        cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferL1);

        printf("/nMSG - Configured L1 cache to 48KB size/n");

// Printing Device properties after changing L1 cache configuration. This should print SM = 16 KB

queryDevice();

// Call Kernel

dim3 dimBlock(16, 16); // just an example

dim3 dimGrid(4, 4); // just an example

MyKernel<<<dimGrid, dimBlock>>>();

printf("/nMSG - Check if L1 configuration has been taken after running kernel/n");

// querying Device to check if L1 configuration has been taken

queryDevice();

}

void queryDevice() {

int dev = 0;

CUDA_SAFE_CALL_NO_SYNC(cudaGetDeviceProperties(&deviceProp, dev));

printf("  Total amount of global memory:                 %llu bytes\n", (unsigned long long) deviceProp.totalGlobalMem);

        printf("  Total amount of constant memory:               %u bytes\n", deviceProp.totalConstMem);

        printf("  Total amount of shared memory per block:       %u bytes\n", deviceProp.sharedMemPerBlock);

}

The results are below … please excuse any typos

Using device 0: GeForce GTX 480

  Total amount of global memory:                 1576468480 bytes

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

MSG - Configured L1 cache to 48KB size

Using device 0: GeForce GTX 480

  Total amount of global memory:                 1576468480 bytes

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

MSG - Check if L1 configuration has been taken after running kernel

Using device 0: GeForce GTX 480

  Total amount of global memory:                 1576468480 bytes

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

However, in all three cases it does not show the changed L1 configuration. Am I still missing something? Could it be that my GTX 480 does not support this? How can I be sure?

Thanks and Regards,

embeddeduser

cudaFuncSetCacheConfig() configures the L1 / shared memory split for a particular global function, i.e. a compute kernel. In other words, it configures a per-function property that can override a per-device property. Note that the request for large L1 may be ignored by the driver if it would prevent the kernel from executing due to lack of shared memory.

The online documentation for cudaFuncSetCacheConfig() states (emphasis mine):

“On devices where the L1 cache and shared memory use the same hardware resources, this sets through cacheConfig the preferred cache configuration for the function specified via func. This is only a preference. The runtime will use the requested configuration if possible, but it is free to choose a different configuration if required to execute func.”

[later]

The online documentation for cudaGetDeviceProperties() explains that the property sharedMemPerBlock refers to the maximum amount available, meaning this value is fixed for a given device (emphasis again mine):

“sharedMemPerBlock is the maximum amount of shared memory available to a thread block in bytes; this amount is shared by all thread blocks simultaneously resident on a multiprocessor;”

Any update on this issue? I am also facing the same issue

Yes. cudaGetDeviceProperties() only reflects the maximum cache available per SM, not the current setting.

Execute your application in the nvvp profiler. It reports the requested and actual cache configuration of every kernel launch.