how can I use 2 GPU and split the work between them
hi all,


I'm trying to do matrix multiplication with two GPUs to let device 0 work for Upper half of matrix C is and device 1 for lower half of matrix C ...using zero copy .

first ,I don't know do we have to use one kernel or 2 kernel ?
second ,how can I control the upper part and the second part ?
third,do I have to use cudaMemcpyAsync() ?
/wallbash.gif' class='bbc_emoticon' alt=':wallbash:' /> /wallbash.gif' class='bbc_emoticon' alt=':wallbash:' /> /wallbash.gif' class='bbc_emoticon' alt=':wallbash:' />

what I did is like this

[code]

//device 0 //

cudaGetDeviceProperties(&prop, 0);
if (!prop.canMapHostMemory)
exit(0);
cudaSetDeviceFlags(cudaDeviceMapHost);

//float* a_h;
-----
-----
cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);
cudaHostAlloc(&b_h, nBytes, cudaHostAllocMapped);
cudaHostAlloc(&c_h, nBytes, cudaHostAllocMapped);

//float* a_map;
----
----
//
cudaHostGetDevicePointer(&a_map, a_h, 0);
cudaHostGetDevicePointer(&b_map, a_h, 0);
cudaHostGetDevicePointer(&c_map, a_h, 0);


kernel<<<gridSize, blockSize>>>(a_map,b_map,c_map);


//device 1//
cudaGetDeviceProperties(&prop, 1);
if (!prop.canMapHostMemory)
exit(0);
cudaSetDeviceFlags(cudaDeviceMapHost);

//float* a_h;
-----
-----
cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);
cudaHostAlloc(&b_h, nBytes, cudaHostAllocMapped);
cudaHostAlloc(&c_h, nBytes, cudaHostAllocMapped);

//float* a_map;
----
----
//
cudaHostGetDevicePointer(&a_map, a_h, 0);
cudaHostGetDevicePointer(&b_map, a_h, 0);
cudaHostGetDevicePointer(&c_map, a_h, 0);


kernel<<<gridSize, blockSize>>>(a_map,b_map,c_map);


[/code]

lookind foroward to some help .

Thanks
hi all,





I'm trying to do matrix multiplication with two GPUs to let device 0 work for Upper half of matrix C is and device 1 for lower half of matrix C ...using zero copy .



first ,I don't know do we have to use one kernel or 2 kernel ?

second ,how can I control the upper part and the second part ?

third,do I have to use cudaMemcpyAsync() ?

/wallbash.gif' class='bbc_emoticon' alt=':wallbash:' /> /wallbash.gif' class='bbc_emoticon' alt=':wallbash:' /> /wallbash.gif' class='bbc_emoticon' alt=':wallbash:' />



what I did is like this







//device 0 //



cudaGetDeviceProperties(&prop, 0);

if (!prop.canMapHostMemory)

exit(0);

cudaSetDeviceFlags(cudaDeviceMapHost);



//float* a_h;

-----

-----

cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&b_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&c_h, nBytes, cudaHostAllocMapped);



//float* a_map;

----

----

//

cudaHostGetDevicePointer(&a_map, a_h, 0);

cudaHostGetDevicePointer(&b_map, a_h, 0);

cudaHostGetDevicePointer(&c_map, a_h, 0);





kernel<<<gridSize, blockSize>>>(a_map,b_map,c_map);





//device 1//

cudaGetDeviceProperties(&prop, 1);

if (!prop.canMapHostMemory)

exit(0);

cudaSetDeviceFlags(cudaDeviceMapHost);



//float* a_h;

-----

-----

cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&b_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&c_h, nBytes, cudaHostAllocMapped);



//float* a_map;

----

----

//

cudaHostGetDevicePointer(&a_map, a_h, 0);

cudaHostGetDevicePointer(&b_map, a_h, 0);

cudaHostGetDevicePointer(&c_map, a_h, 0);





kernel<<<gridSize, blockSize>>>(a_map,b_map,c_map);








lookind foroward to some help .



Thanks

#1
Posted 04/08/2012 09:36 AM   
Hi,
It that normal I don't see any "cudaSetDevice()" in your code?
Hi,

It that normal I don't see any "cudaSetDevice()" in your code?

#2
Posted 04/08/2012 09:45 AM   
You need use streams and the cudasetdevice to issues kernel calls on different devices.
You need use streams and the cudasetdevice to issues kernel calls on different devices.

#3
Posted 04/08/2012 11:40 AM   
isin't that enough to use cudaSetDevice()??

like this
///Device 0////


cudaGetDeviceProperties(&prop, 0);

if (!prop.canMapHostMemory)

exit(0);

cudaSetDeviceFlags(cudaDeviceMapHost);
Then..................

///Device 1////


cudaGetDeviceProperties(&prop, 1);

if (!prop.canMapHostMemory)

exit(0);

cudaSetDeviceFlags(cudaDeviceMapHost);
------------------------------------------------------

/confused.gif' class='bbc_emoticon' alt=':confused:' /> /confused.gif' class='bbc_emoticon' alt=':confused:' />

or just
cudaGetDevice(0)
do something
cudaGetDevice(2)

then how I can assign each device to do something still don't get the idea /wallbash.gif' class='bbc_emoticon' alt=':wallbash:' />
please could any one give me the steps in order /blink.gif' class='bbc_emoticon' alt=':blink:' /> .

Thank you
isin't that enough to use cudaSetDevice()??



like this

///Device 0////





cudaGetDeviceProperties(&prop, 0);



if (!prop.canMapHostMemory)



exit(0);



cudaSetDeviceFlags(cudaDeviceMapHost);

Then..................



///Device 1////





cudaGetDeviceProperties(&prop, 1);



if (!prop.canMapHostMemory)



exit(0);



cudaSetDeviceFlags(cudaDeviceMapHost);

------------------------------------------------------



/confused.gif' class='bbc_emoticon' alt=':confused:' /> /confused.gif' class='bbc_emoticon' alt=':confused:' />



or just

cudaGetDevice(0)

do something

cudaGetDevice(2)



then how I can assign each device to do something still don't get the idea /wallbash.gif' class='bbc_emoticon' alt=':wallbash:' />

please could any one give me the steps in order /blink.gif' class='bbc_emoticon' alt=':blink:' /> .



Thank you

#4
Posted 04/08/2012 12:05 PM   
Hello,

You can so something like this:

[code]
cudaSetDevice(0);

//kernel calls with pointers from the device 0

cudaSetDevice(1);

//kernel calls with pointers from device 1

//collect the results
[/code]
Hello,



You can so something like this:





cudaSetDevice(0);



//kernel calls with pointers from the device 0



cudaSetDevice(1);



//kernel calls with pointers from device 1



//collect the results

#5
Posted 04/08/2012 02:00 PM   
You might also find that ArrayFire makes multi-GPU usage much easier (handles the streams & synchronization automatically for you and automatically scales to the number of GPUs in the system). [url="http://www.accelereyes.com/arrayfire/c/group__device__mat.htm"]Details are here[/url].
You might also find that ArrayFire makes multi-GPU usage much easier (handles the streams & synchronization automatically for you and automatically scales to the number of GPUs in the system). Details are here.

John Melonakos ([email="john.melonakos@accelereyes.com"]john.melonakos@accelereyes.com[/email])

#6
Posted 04/08/2012 09:47 PM   
Thanks for replay ,I still don't know how to do it /pinch.gif' class='bbc_emoticon' alt=':pinch:' />

if it is zero copy so Idon't have to use cudaMalloc or cudaMemcpy ,I just use

cudaHostAlloc ,cudaHostGetDevicePointer then what I should to do to make half the upper C is there then the second one ?

/wallbash.gif' class='bbc_emoticon' alt=':wallbash:' />

any order in this will help me so much /unsure.gif' class='bbc_emoticon' alt=':unsure:' />
Thanks for replay ,I still don't know how to do it /pinch.gif' class='bbc_emoticon' alt=':pinch:' />



if it is zero copy so Idon't have to use cudaMalloc or cudaMemcpy ,I just use



cudaHostAlloc ,cudaHostGetDevicePointer then what I should to do to make half the upper C is there then the second one ?



/wallbash.gif' class='bbc_emoticon' alt=':wallbash:' />



any order in this will help me so much /unsure.gif' class='bbc_emoticon' alt=':unsure:' />

#7
Posted 04/08/2012 11:41 PM   
What about this: make your code working on one GPU, show us, and we'll help you porting it to multiple GPUs. Giving hints blindly isn't very effective.
What about this: make your code working on one GPU, show us, and we'll help you porting it to multiple GPUs. Giving hints blindly isn't very effective.

#8
Posted 04/09/2012 06:09 AM   
Scroll To Top