As a follow-up to The Official NVIDIA Forums | NVIDIA
What about mvapich2 1.8a2? Does this line in the change log mean that P2P should work between ranks now?
- Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1)
Because I sure can’t seem to get it to work.
I’ve got some Dell 6100 hosts with m2070’s in C410x boxes attached - connected in an 8:1 configuration. They don’t have any IB cards.
If you simply have to ask why, we run single-gpu jobs 99.99% of the time here, the 8:1 configuration does not slow performance of these runs at all and saves us from spending money on lots of unused host nodes. However, the combination of P2P transfers and the ease of use of device pointers in MPI calls seems like a perfect combo for us to try some multi-gpu code.
All I’m able to get is 500MB/s in the “osu_bw D D” test
CMA: no RDMA devices found
CMA: no RDMA devices found
# OSU MPI-CUDA Bandwidth Test v3.5.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.01
2 0.02
4 0.05
8 0.09
16 0.18
32 0.36
64 0.73
128 1.46
256 2.92
512 5.82
1024 11.63
2048 23.08
4096 45.74
8192 89.64
16384 172.07
32768 315.15
65536 549.34
131072 525.78
262144 516.37
524288 513.17
1048576 516.52
2097152 517.11
4194304 518.28
I’ve got the following env vars set
setenv MV2_USE_CUDA 1
setenv MV2_USE_SHARED_MEM 1
and no amount of fiddling with MV2_CUDA_IPC* variables makes a difference.
Surely the internal switches in the c410x are capable of more than 500MB/s?