cudaMalloc extremely slow on gtx980 and titan

I have 2 gtx 980 and 2 titans on my server and the malloc is extremely slow. 2 gtx980 and 1 titan runs on X8 and one titan runs on X16. I have a macbook pro and Cuda program runs faster on my macbook pro than it does on my server. I run my code and check nvidia-smi. No matter which device I choose to run it on using cudaSetDevice, it will start allocating memory in both the device that I chose and the titan that is running on X16. My code isn’t doing something fancy it’s Cuda with some OpenCV. Has anyone run into bug like this? I know that nvidia-smi doesn’t really work on GTX 980s but I run into this problem even if I run it on the other Titan.

Call cudaSetDevice FIRST, and then do the malloc.

That’s what I’m doing in my code

cudaSetDevice(2);
cudaMalloc((void**)&dev_result, result_size);
cudaMalloc((void**)&dev_dest_lines, line_size);
cudaMalloc((void**)&dev_src_lines, line_size);

cudaMemcpy((void**)dev_dest_lines, g_dest_lines, line_size, cudaMemcpyHostToDevice);
cudaMemcpy((void**)dev_src_lines, g_src_lines, line_size, cudaMemcpyHostToDevice);

Are you also doing cudaMallocHost? It has to be called after cudaSetDevice as well.

I was using normal malloc but I tried it with cudaMallocHost but it does the same thing.

It is not clear how you are measuring the performance of cudaMalloc(). If cudaMalloc() is the first CUDA API call [other than cudaSetDevice()] in your code it will trigger the creation of the CUDA context. It is my understanding that the mapping activities needed for UVM support can lengthen CUDA context creation time considerably on machines with large amounts of system memory, buit this is a one-time startup cost.

Try calling cudaFree(0) to trigger the CUDA context creation, then measure the duration of the cudaMalloc() calls.

njuffa you were right. cudaMalloc was taking a long time because of the context creation now cudaFree is the on that takes the longest time. Is there a way to reduce the context creation cost? Also there is still the issue of my program creating memory in a different device than what I set in cudaSetDevice, would you happen to know anything about that as well?

If your program only intends to use a single device, you can limit the CUDA runtime for that session/run to only use a single device with the CUDA_VISIBLE_DEVICES environment variable, documented here:

[url]Programming Guide :: CUDA Toolkit Documentation

txbob CUDA_VISIBLE_DEVICES worked! Thanks! Do you know why the CUDA runtime has this weird behaviour?

What specifically do you consider “weird behavior”?

Without setting CUDA_VISIBLE_DEVICES when I run my program and I set my cudaSetDevice to use gtx980 the runtime will create memory on the Titan as well. I will have memory created on the gtx980 that is running my code and also on the titan. 980 will use about 300mb and then the titan will use about 100mb. That is what I see on the nvidia-smi. After I set CUDA_VISIBLE_DEVICES this problem doesn’t happen.

Which device indices (as enumerated by the CUDA runtime) are the GPUs in question?

The process of creating a cuda context on a particular device consumes memory. It’s likely that the runtime is creating a context of some sort on the “unused” titan, to consume 100mb. My guess is that the Titan in question is enumerated as device 0, in which case the behavior doesn’t surprise me, although I can’t give you chapter and verse of the documentation which describes exactly why this should be the case.

Nevertheless, the CUDA runtime has all “exposed” devices in its view. This has widespread implications for UVA, UM, SLI, P2P, and many other mechanisms under the CUDA umbrella. If you want to limit this “view”, use CUDA_VISIBLE_DEVICES.

As txbob says, the combined footprint of the CYDA driver and CUDA runtime context on each device is in the 90 MB to 100MB range. So even if you do not run anything on a device this much memory is going to be occupied by the CUDA software stack itself.

Unified memory requires CUDA to map the memory from each GPU in the system and all of host memory into a single unified virtual address space. My understanding is that the vast majority of the time required for this is spent in OS calls, and it increases with the total amount of GPU + host memory that needs to be mapped.

As far as the enumeration of devices by the CUDA runtime goes, my understanding is that the CUDA runtime contains a heuristic that tries to assign the “most capable” device in a system as device 0. If the GPUs in question are GTX 980 and GTX Titan it stands to reason that the Titan would wind up as device 0 since it is the “more capable” device.

CUDA_VISIBLE_DEVICES can be used to exclude specific GPUs from both the enumeration and the memory mapping process performed by the CUDA runtime.

Oh man! Thanks guys! I’ve learned so much!