MVAPICH P2P woes
As a follow-up to http://forums.nvidia.com/index.php?showtopic=215373&view=findpost&p=1326421

What about mvapich2 1.8a2? Does this line in the change log mean that P2P should work between ranks now?
[code]
- Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1)
[/code]
Because I sure can't seem to get it to work.

I've got some Dell 6100 hosts with m2070's in C410x boxes attached - connected in an 8:1 configuration. They don't have any IB cards.

If you simply have to ask why, we run single-gpu jobs 99.99% of the time here, the 8:1 configuration does not slow performance of these runs at all and saves us from spending money on lots of unused host nodes. However, the combination of P2P transfers and the ease of use of device pointers in MPI calls seems like a perfect combo for us to try some multi-gpu code.

All I'm able to get is 500MB/s in the "osu_bw D D" test
[code]
CMA: no RDMA devices found
CMA: no RDMA devices found
# OSU MPI-CUDA Bandwidth Test v3.5.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.01
2 0.02
4 0.05
8 0.09
16 0.18
32 0.36
64 0.73
128 1.46
256 2.92
512 5.82
1024 11.63
2048 23.08
4096 45.74
8192 89.64
16384 172.07
32768 315.15
65536 549.34
131072 525.78
262144 516.37
524288 513.17
1048576 516.52
2097152 517.11
4194304 518.28
[/code]

I've got the following env vars set
[code]
setenv MV2_USE_CUDA 1
setenv MV2_USE_SHARED_MEM 1
[/code]
and no amount of fiddling with MV2_CUDA_IPC* variables makes a difference.

Surely the internal switches in the c410x are capable of more than 500MB/s?
As a follow-up to http://forums.nvidia.com/index.php?showtopic=215373&view=findpost&p=1326421



What about mvapich2 1.8a2? Does this line in the change log mean that P2P should work between ranks now?



- Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1)


Because I sure can't seem to get it to work.



I've got some Dell 6100 hosts with m2070's in C410x boxes attached - connected in an 8:1 configuration. They don't have any IB cards.



If you simply have to ask why, we run single-gpu jobs 99.99% of the time here, the 8:1 configuration does not slow performance of these runs at all and saves us from spending money on lots of unused host nodes. However, the combination of P2P transfers and the ease of use of device pointers in MPI calls seems like a perfect combo for us to try some multi-gpu code.



All I'm able to get is 500MB/s in the "osu_bw D D" test



CMA: no RDMA devices found

CMA: no RDMA devices found

# OSU MPI-CUDA Bandwidth Test v3.5.1

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size Bandwidth (MB/s)

1 0.01

2 0.02

4 0.05

8 0.09

16 0.18

32 0.36

64 0.73

128 1.46

256 2.92

512 5.82

1024 11.63

2048 23.08

4096 45.74

8192 89.64

16384 172.07

32768 315.15

65536 549.34

131072 525.78

262144 516.37

524288 513.17

1048576 516.52

2097152 517.11

4194304 518.28




I've got the following env vars set



setenv MV2_USE_CUDA 1

setenv MV2_USE_SHARED_MEM 1


and no amount of fiddling with MV2_CUDA_IPC* variables makes a difference.



Surely the internal switches in the c410x are capable of more than 500MB/s?

#1
Posted 03/09/2012 05:44 PM   
Can you please try 1.8RC1 version of MVAPICH2. It has improved CUDA IPC based designs for nodes with multiple GPUs. Let us know if you see any performance issues.

Sreeram Potluri

[quote name='DrAnderson42' date='09 March 2012 - 05:44 PM' timestamp='1331315086' post='1380633']
As a follow-up to http://forums.nvidia.com/index.php?showtopic=215373&view=findpost&p=1326421

What about mvapich2 1.8a2? Does this line in the change log mean that P2P should work between ranks now?
[code]
- Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1)
[/code]
Because I sure can't seem to get it to work.

I've got some Dell 6100 hosts with m2070's in C410x boxes attached - connected in an 8:1 configuration. They don't have any IB cards.

If you simply have to ask why, we run single-gpu jobs 99.99% of the time here, the 8:1 configuration does not slow performance of these runs at all and saves us from spending money on lots of unused host nodes. However, the combination of P2P transfers and the ease of use of device pointers in MPI calls seems like a perfect combo for us to try some multi-gpu code.

All I'm able to get is 500MB/s in the "osu_bw D D" test
[code]
CMA: no RDMA devices found
CMA: no RDMA devices found
# OSU MPI-CUDA Bandwidth Test v3.5.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.01
2 0.02
4 0.05
8 0.09
16 0.18
32 0.36
64 0.73
128 1.46
256 2.92
512 5.82
1024 11.63
2048 23.08
4096 45.74
8192 89.64
16384 172.07
32768 315.15
65536 549.34
131072 525.78
262144 516.37
524288 513.17
1048576 516.52
2097152 517.11
4194304 518.28
[/code]

I've got the following env vars set
[code]
setenv MV2_USE_CUDA 1
setenv MV2_USE_SHARED_MEM 1
[/code]
and no amount of fiddling with MV2_CUDA_IPC* variables makes a difference.

Surely the internal switches in the c410x are capable of more than 500MB/s?
[/quote]
Can you please try 1.8RC1 version of MVAPICH2. It has improved CUDA IPC based designs for nodes with multiple GPUs. Let us know if you see any performance issues.



Sreeram Potluri



[quote name='DrAnderson42' date='09 March 2012 - 05:44 PM' timestamp='1331315086' post='1380633']

As a follow-up to http://forums.nvidia.com/index.php?showtopic=215373&view=findpost&p=1326421



What about mvapich2 1.8a2? Does this line in the change log mean that P2P should work between ranks now?



- Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1)


Because I sure can't seem to get it to work.



I've got some Dell 6100 hosts with m2070's in C410x boxes attached - connected in an 8:1 configuration. They don't have any IB cards.



If you simply have to ask why, we run single-gpu jobs 99.99% of the time here, the 8:1 configuration does not slow performance of these runs at all and saves us from spending money on lots of unused host nodes. However, the combination of P2P transfers and the ease of use of device pointers in MPI calls seems like a perfect combo for us to try some multi-gpu code.



All I'm able to get is 500MB/s in the "osu_bw D D" test



CMA: no RDMA devices found

CMA: no RDMA devices found

# OSU MPI-CUDA Bandwidth Test v3.5.1

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size Bandwidth (MB/s)

1 0.01

2 0.02

4 0.05

8 0.09

16 0.18

32 0.36

64 0.73

128 1.46

256 2.92

512 5.82

1024 11.63

2048 23.08

4096 45.74

8192 89.64

16384 172.07

32768 315.15

65536 549.34

131072 525.78

262144 516.37

524288 513.17

1048576 516.52

2097152 517.11

4194304 518.28




I've got the following env vars set



setenv MV2_USE_CUDA 1

setenv MV2_USE_SHARED_MEM 1


and no amount of fiddling with MV2_CUDA_IPC* variables makes a difference.



Surely the internal switches in the c410x are capable of more than 500MB/s?

#2
Posted 04/07/2012 03:32 AM   
[quote name='potluri' date='06 April 2012 - 10:32 PM' timestamp='1333769538' post='1392913']
Can you please try 1.8RC1 version of MVAPICH2. It has improved CUDA IPC based designs for nodes with multiple GPUs. Let us know if you see any performance issues.
[/quote]

Yep, with 1.8RC1, everything is working great. >6GB/s in the bandwidth test benchmark, and decent performance in my application.
[quote name='potluri' date='06 April 2012 - 10:32 PM' timestamp='1333769538' post='1392913']

Can you please try 1.8RC1 version of MVAPICH2. It has improved CUDA IPC based designs for nodes with multiple GPUs. Let us know if you see any performance issues.





Yep, with 1.8RC1, everything is working great. >6GB/s in the bandwidth test benchmark, and decent performance in my application.

#3
Posted 04/09/2012 01:43 PM   
Scroll To Top