Cuda samples fail to allocate memory after running a few pieces of code

greenberet123 · August 25, 2016, 9:30pm

Hi,

Im running Ubuntu 16.04 and coda 7.5 and nvidia drivers 361 for my 2 Tesla k40c GPUs.

I was able to run the cuda samples (specifically vectorAdd).

I ran a few cud programs using cutorch and now when I try to run vectorAdd, it says

$ sudo ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code out of memory)!

If i restart the machines, things work again, and then stop after a few runs of my code. This was also happening earlier with ubuntu 14.

The debug log is here: http://sprunge.us/hhaM

Thanks in advance!

Robert_Crovella · August 25, 2016, 9:47pm

a process of yours (presumably in your cutorch workflow) is terminating in a bad fashion and not freeing memory

normal process termination should release any allocations.

You could try using the reset facility in nvidia-smi to try to reset the GPUs in question. If that is possible, it should fix the issue without a reboot. You could also try to identify any processes associated with the GPU in question using nvidia-smi and kill those processes manually.

otherwise you’ll need to identify your process termination issues and rectify them, or else reboot the system.

mfatica · August 25, 2016, 10:38pm

There was a bug in certain drivers where the memory was not released if the process was terminated.
Try to use the latest 361 driver, I don’t remember in which version was fixed.

njuffa · August 26, 2016, 3:36am

I assume you meant “There was a bug in certain drivers where the memory was not released if the process was terminated abnormally” ?

greenberet123 · August 28, 2016, 7:10pm

Thanks for the reply guys. Still no luck.

I successfully reset both GPUs in my machine using nvidia-smi
According the nvidia-smi, there are no processes running that are using the gpu.

I am using NVIDIA-SMI 361.42 … which I installed just a few days ago.

I cannot reboot the machine since many others are logged in.

I tried ‘rmmod’ followed by ‘modprobe’ of the nvidia driver. Even that didn’t fix it.

Is there something else I can do to refresh everything and emulate the effect of rebooting? Thanks!

greenberet123 · September 6, 2016, 10:47pm

bump