Nvidia-smi failed to initialize NVML with unknow error

Hi,

This is my first post. If my topic should be posted under other category, please advise.

I have a bunch of Quadro (M4000, P2000, P4000) and Tesla (M10, M60, P4) GPUs to validate on DELL R720 server. Under Windows server 2008R2 SP1, most of the GPUs (with only one exception - Tesla P4) do not support CUDA feature, even though the driver is shown to be working properly under device manager. I tried all different drivers, both from NVIDIA and DELL, the very first and the latest. All behave the same. All drivers claim to be compatible on Windows Server 2008R2. I have two physical servers tested side by side to rule out any hardware specific issue.

NVidia-smi.exe reports “Failed to initialize NVML: Unknown Error”.
GPU-Z either reports a GPU with no CUDA feature or crashes the machine
CUDA-Z reports no CUDA capable devices found

If I install Windows server 2012 or above OS on the same machine, all GPUs just work fine with CUDA feature. So this does not seems to be BIOS or hardware specific issue.
Even more, if I install VMWare ESXi on the server to make it a VM host and then passthrough the GPU to a VM running Windows Server 2008R2SP1, the CUDA feature just works fine. So it does not seems to be OS specific issue either. However, moving to a VM or upgrading to a newer OS is not a viable solution on hundreds of our existing customer machines.

Is there any utility or log that I can gain insight what is causing the issue. Any comment is much appreciated.

Regards,
Chunguang

We have hundreds of systems running on Windows server 2008R2 and want to upgrade the hardware to the latest GPU.