My system is Lenovo nx360 M5 with 2M40 GPU and it installed with Centos 7 and cuda 8.0 (driver 375.39). While I do training with 2GPU by tensorflow, it makes system reboot. But one GPU runs well. I guess it is peer to peer memory accessing issue, so I run simpleP2P, but it also makes the system reboot.Anyone can help on it?
Thu Feb 23 13:50:55 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 Off | 0000:0D:00.0 Off | 0 |
| N/A 30C P0 65W / 250W | 0MiB / 11443MiB | 43% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M40 Off | 0000:0E:00.0 Off | 0 |
| N/A 31C P0 60W / 250W | 0MiB / 11443MiB | 2% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2
GPU0 = " Tesla M40" IS capable of Peer-to-Peer (P2P)
GPU1 = " Tesla M40" IS capable of Peer-to-Peer (P2P)
Checking GPU(s) for support of peer to peer memory access…
Peer access from Tesla M40 (GPU0) → Tesla M40 (GPU1) : Yes
Peer access from Tesla M40 (GPU1) → Tesla M40 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
Tesla M40 (GPU0) supports UVA: Yes
Tesla M40 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…