PCI passthrough KVM for CUDA usage

Hello,

I am trying to passthrough Tesla K40m to Virtual Machine(qemu-kvm hypervisor) by vfio.

I download all drivers and CUDA libraries + I compiled all sample files succesfully. However when I run them they run but in the end the do not finish =(. For example run log of deviceQuery:

deviceQuery Starting… CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: “Tesla K40m” //INFO ABOUT IT Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla K40m Result = PASS

And then it just hangs, only option is to ctrl+c. Moreover I installed everything on host too and there it finished sucessfully without any problems. Any help will be appreciated.

dmesg on VM says only: [ 1475.225692] nvidia 0000:00:08.0: irq 51 for MSI/MSI-X dmesg on host: kernel: [ 2897.503162] vfio-pci 0000:02:00.0: irq 324 for MSI/MSI-X

Moreover any call to pci is taking too much time, for example I tried to call nvidia-smi in VM and on host system and traced it via strace. Here output from VM:

±-----------------------------------------------------+
| NVIDIA-SMI 346.59 Driver Version: 346.59 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 0000:00:06.0 Off | 0 |
| N/A 54C P0 64W / 235W | 55MiB / 11519MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
% time seconds usecs/call calls errors syscall


98.67 4.688353 275785 17 open
1.08 0.051337 3020 17 close
0.23 0.010722 104 103 ioctl
0.01 0.000261 22 12 read
0.00 0.000235 9 26 mmap
0.00 0.000177 15 12 write
0.00 0.000127 16 8 munmap
0.00 0.000107 11 10 mprotect
0.00 0.000094 19 5 1 stat
0.00 0.000070 5 15 fstat
0.00 0.000055 8 7 7 access
0.00 0.000030 30 1 execve
0.00 0.000018 5 4 fcntl
0.00 0.000015 8 2 1 futex
0.00 0.000013 4 3 brk
0.00 0.000007 4 2 rt_sigaction
0.00 0.000006 6 1 getrlimit
0.00 0.000005 5 1 lseek
0.00 0.000004 4 1 set_robust_list
0.00 0.000003 3 1 rt_sigprocmask
0.00 0.000003 3 1 arch_prctl
0.00 0.000003 3 1 set_tid_address


100.00 4.751645 250 9 total

Here is output when I run nvidia-smi from host (I de-attach it from VM beforehand)

±-----------------------------------------------------+
| NVIDIA-SMI 346.59 Driver Version: 346.59 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 0000:02:00.0 Off | 0 |
| N/A 48C P0 64W / 235W | 55MiB / 11519MiB | 60% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
% time seconds usecs/call calls errors syscall


82.25 0.571723 33631 17 open
15.70 0.109104 6418 17 close
1.76 0.012264 119 103 ioctl
0.10 0.000664 44 15 read
0.05 0.000370 14 26 mmap
0.02 0.000155 16 10 mprotect
0.02 0.000152 22 7 7 access
0.02 0.000134 9 15 fstat
0.01 0.000100 13 8 munmap
0.01 0.000078 26 3 brk
0.01 0.000070 6 12 write
0.01 0.000069 17 4 fcntl
0.01 0.000062 62 1 execve
0.00 0.000029 6 5 1 stat
0.00 0.000021 11 2 rt_sigaction
0.00 0.000021 11 2 1 futex
0.00 0.000010 10 1 rt_sigprocmask
0.00 0.000010 10 1 getrlimit
0.00 0.000010 10 1 arch_prctl
0.00 0.000010 10 1 set_tid_address
0.00 0.000009 9 1 set_robust_list
0.00 0.000000 0 1 lseek


100.00 0.695065 253 9 total

As you can see “open” from VM takes too much time. I have no idea why.
Can anybody help me? Sorry for so much text

hi
I think I have the same problem
physical host is centos7 (unbind the dev all that)

First with a centos7 VM (kvm) and later tried also with ubuntu14.04 VM
In the VM, and in both cases
I installed cuda 7.5, the samples and the NVIDIA-Linux-x86_64-352.39 driver

nvidia-smi
Mon Dec 21 14:55:58 2015
±-----------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m On | 0000:00:05.0 Off | 0 |
| N/A 26C P8 22W / 235W | 22MiB / 11519MiB | 0% Default |
±------------------------------±---------------------±---------------------+

The cuda samples

1_Utilities/deviceQuery/deviceQuery
1_Utilities/deviceQuery/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla K40m
Result = PASS

But other such as
0_Simple/cdpSimplePrint/cdpSimplePrint
starting Simple Print (CUDA Dynamic Parallelism)
Running on GPU 0 (Tesla K40m)


The CPU launches 2 blocks of 2 threads each. On the device each thread will
launch 2 blocks of 2 threads each. The GPU we will do that recursively
until it reaches max_depth=2

In total 2+8=10 blocks are launched!!! (8 from the GPU)


^C
I have to CTRL C

I have made some straces

strace -o aacuda -ff -r -ttt -x -y -s 1024 /usr/local/cuda/samples/0_Simple/cdpSimpleQuicksort/cdpSimpleQuicksort

this spawns 2 processes, the first one gets stuck in a

 0.000020 futex(0x7ffe5fdcc7d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1450707640, 875071000}, ffffffff) = 0
 0.000027 ioctl(3</dev/nvidiactl>, 0xc020462a, 0x7ffe5fdcc6e0) = 0
 0.000327 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 316384059}) = 0
 0.000109 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 316479220}) = 0

infinite calls to clock_gettime(CLOCK_MONOTONIC_RAW

in the 2nd process

 0.000039 read(8<pipe:[17877]>, "\xab", 1) = 1
 0.000039 futex(0x7ffe5fdcc7d0, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000034 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 315902405}) = 0
 0.000041 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 315985134}) = 0
 0.000063 poll([{fd=8<pipe:[17877]>, events=POLLIN}, {fd=10</dev/nvidia0>, events=POLLIN}, {fd=11</dev/nvidia0>, events=POLLIN}, {fd=12</dev/nvidia0>, even

ts=POLLIN}, {fd=13pipe:1640, events=POLLIN}], 5, 77) = 0 (Timeout)

and the 2 clock_gettime(CLOCK_MONOTONIC_RAW calls plus the “poll timeout” are called infinitely many times

from both processes seems pipe to the /dev/nvidia0 device, something gets written to the dev but nothing returned, and then it keeps trying infinitely

if you or anyone figured this out I would very much appreciate any hints. note I can boot/reboot the host machine, and can insert boot parameters in grub if necessary
grub/boot contains
GRUB_CMDLINE_LINUX_DEFAULT=“intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1”

(I think the last unsafe_interrupts may not be necessary or advisable but…)

the only thing I would not like to try would be a kernel reconfig/recompilation
for info the host machine is a Dell PE R730, with 1 NVIDIA Tesla k40

best and tia
Mario

hi
I think I have the same problem
physical host is centos7 (unbind the dev all that)

First with a centos7 VM (kvm) and later tried also with ubuntu14.04 VM
In the VM, and in both cases
I installed cuda 7.5, the samples and the NVIDIA-Linux-x86_64-352.39 driver

nvidia-smi
Mon Dec 21 14:55:58 2015
±-----------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m On | 0000:00:05.0 Off | 0 |
| N/A 26C P8 22W / 235W | 22MiB / 11519MiB | 0% Default |
±------------------------------±---------------------±---------------------+

The cuda samples

1_Utilities/deviceQuery/deviceQuery
1_Utilities/deviceQuery/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla K40m
Result = PASS

But other such as
0_Simple/cdpSimplePrint/cdpSimplePrint
starting Simple Print (CUDA Dynamic Parallelism)
Running on GPU 0 (Tesla K40m)


The CPU launches 2 blocks of 2 threads each. On the device each thread will
launch 2 blocks of 2 threads each. The GPU we will do that recursively
until it reaches max_depth=2

In total 2+8=10 blocks are launched!!! (8 from the GPU)


^C
I have to CTRL C

I have made some straces

strace -o aacuda -ff -r -ttt -x -y -s 1024 /usr/local/cuda/samples/0_Simple/cdpSimpleQuicksort/cdpSimpleQuicksort

this spawns 2 processes, the first one gets stuck in a

 0.000020 futex(0x7ffe5fdcc7d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1450707640, 875071000}, ffffffff) = 0
 0.000027 ioctl(3</dev/nvidiactl>, 0xc020462a, 0x7ffe5fdcc6e0) = 0
 0.000327 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 316384059}) = 0
 0.000109 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 316479220}) = 0

infinite calls to clock_gettime(CLOCK_MONOTONIC_RAW

in the 2nd process

 0.000039 read(8<pipe:[17877]>, "\xab", 1) = 1
 0.000039 futex(0x7ffe5fdcc7d0, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000034 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 315902405}) = 0
 0.000041 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 315985134}) = 0
 0.000063 poll([{fd=8<pipe:[17877]>, events=POLLIN}, {fd=10</dev/nvidia0>, events=POLLIN}, {fd=11</dev/nvidia0>, events=POLLIN}, {fd=12</dev/nvidia0>, even

ts=POLLIN}, {fd=13pipe:1640, events=POLLIN}], 5, 77) = 0 (Timeout)

and the 2 clock_gettime(CLOCK_MONOTONIC_RAW calls plus the “poll timeout” are called infinitely many times

from both processes seems pipe to the /dev/nvidia0 device, something gets written to the dev but nothing returned, and then it keeps trying infinitely

if you or anyone figured this out I would very much appreciate any hints. note I can boot/reboot the host machine, and can insert boot parameters in grub if necessary
grub/boot contains
GRUB_CMDLINE_LINUX_DEFAULT=“intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1”

(I think the last unsafe_interrupts may not be necessary or advisable but…)

the only thing I would not like to try would be a kernel reconfig/recompilation
for info the host machine is a Dell PE R730, with 1 NVIDIA Tesla k40

best and tia
Mario

just to let all know, we solved this problem, either 1 or both of 2 things

update kernel from
3.10.0-229.11.1 to 3.10.0-327.3.1

update qemu-kvm and qemu-kvm-common

qemu-kvm--1.5.3-86 to qemu-kvm--ev-2.3.0-31

Hi Mario,
I am also facing the same problem when I try to passthrough my Tesla K40m on kvm using vfio-pci. The qemu version I’m using is 2.2.0 on Ubuntu 14.04 (kernel version 4.2.0)

$ kvm --version
QEMU emulator version 2.2.0 (Debian 1:2.2+dfsg-5expubuntu9.3~cloud0), Copyright (c) 2003-2008 Fabrice Bellard

I’ve also noted that deviceQuery succeeds albeit after long time (~16 secs), but any other cuda sample has the same problem - 100% cpu utilization, and lots of clock_gettime calls and it never completes. I debugged it and I’ve the same findings as yours, the cuda sample application continuously makes some ioctl call to possibly detect change in some state of the card but it doesn’t note that change and hence continuously keeps on waiting there, making the clock_gettime call in order to report how much time it spent in that operation.

I went through the nvidia driver code hoping to get more insight into which ioctl the application is making and what is it expecting to change, but alas it lead me to rm_ioctl() which is part of the nvidia binary driver.

I’m glad to hear that your problem went way after upgrading to qemu 2.3. Since I’m already using 2.2, I hope there is some change that went between 2.2 and 2.3, since my kernel version is already uptodate.

Since Ubuntu 14.04 does not have a qemu package for ver > 2.2, I’ll have to compile by hand. Let me try and I’ll update with my findings.

Thanks,
Tomar

Mario, also wanted to check what firmware you are using for your virtual machines - Seabios (plain old bios) or OVMF (virtual UEFI f/w).

Thanks,
Tomar

I can also confirm that upgrading qemu to version 2.4.1 solved this problem.
I didn’t try 2.3, so maybe that also solves the problem as Mario has reported.
This means the problem exists in qemu 2.2.0 but not in 2.3.x+ versions.

Thanks,
Tomar