Proccess block when call Nccl reduce

I has three nodes, and each one has one Gpu(M40).
I use the code example(http://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#examples) to test nccl for working on multi nodes.

Below Knobs had been set before I run the process.
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1

And below is the block stack
(cuda-gdb) bt
#0 0x00007fffe7bd6497 in sched_yield () from /lib64/libc.so.6
#1 0x00007fffedafb155 in ncclCpuBarrierWait (comm=comm@entry=0x463bc90) at misc/enqueue.cu:71
#2 0x00007fffedafbb65 in ncclEnqueueCheck (func=func@entry=0x7fffedb4ab10 <ncclReduceFunc(void const*, void*, unsigned long, ncclDataType_t, ncclRedOp_t, int, ncclComm*, CUstream_st*)>,
primName=primName@entry=0x7fffedbc62db <ncclIbGetMr(ncclIbVerbs*, void*, int, ncclIbMr**)::PRETTY_FUNCTION+9691> “Reduce”, sendbuff=0x810a5c0400, recvbuff=0x810a5c0400, count=100,
type=ncclInt8, op=ncclSum, root=0, comm=comm@entry=0x463bc90, stream=stream@entry=0x2bd85f0) at misc/enqueue.cu:119
#3 0x00007fffedb4ab00 in ncclReduce (sendbuff=, recvbuff=, count=, datatype=, op=, root=, comm=0x463bc90,
stream=0x2bd85f0) at collectives/reduce.cu:236
#4 0x000000000040a02d in RunTest (sendbuff=0x2bd8870, recvbuff=0x2bd87f0, N=100, type=ncclInt8, op=ncclSum, root=0, comms=0x6aefe0, dList=…) at src/reduce_test.cu:134
#5 0x0000000000407980 in RunTests (N=100, type=ncclInt8, comms=0x6aefe0, dList=…) at src/reduce_test.cu:193
#6 0x00000000004033fd in main (argc=3, argv=0x7fffffffe358) at src/reduce_test.cu:294

I also encountered this problem. So do you solve the problem finally and how? Thank you.