a process of yours (presumably in your cutorch workflow) is terminating in a bad fashion and not freeing memory
normal process termination should release any allocations.
You could try using the reset facility in nvidia-smi to try to reset the GPUs in question. If that is possible, it should fix the issue without a reboot. You could also try to identify any processes associated with the GPU in question using nvidia-smi and kill those processes manually.
otherwise you’ll need to identify your process termination issues and rectify them, or else reboot the system.
There was a bug in certain drivers where the memory was not released if the process was terminated.
Try to use the latest 361 driver, I don’t remember in which version was fixed.