Data transfer between CPU and GPU
Hi there, I have two questions:

First question: I need to transfer data from GPU to CPU and CPU to GPU. To compute the transfer rate I'm timing the transfers using OpenCL Events; It looks like the transfer from GPU to CPU is faster than the transfer from CPU to GPU (12.2GB/s vs 11GB/s). I read somewhere that this behavior is normal, but don't know why: is it because restrictions imposed by the PCIe or the GPU ?. Any explanation and links will be useful. BTW: I'm using a NVidia C2070 GPU and a PCIe x16 2nd Generation; and the buffer at the host is pinned memory

Second question is: What I actually need is to transfer data from GPU1 to GPU2, so I'm transferring by doing 2 transfers: GPU-CPU and then CPU-GPU using pinned memory. Is there any way to transfer GPU-GPU directly ?. Both GPUs are C2070.

Thanks.
Hi there, I have two questions:



First question: I need to transfer data from GPU to CPU and CPU to GPU. To compute the transfer rate I'm timing the transfers using OpenCL Events; It looks like the transfer from GPU to CPU is faster than the transfer from CPU to GPU (12.2GB/s vs 11GB/s). I read somewhere that this behavior is normal, but don't know why: is it because restrictions imposed by the PCIe or the GPU ?. Any explanation and links will be useful. BTW: I'm using a NVidia C2070 GPU and a PCIe x16 2nd Generation; and the buffer at the host is pinned memory



Second question is: What I actually need is to transfer data from GPU1 to GPU2, so I'm transferring by doing 2 transfers: GPU-CPU and then CPU-GPU using pinned memory. Is there any way to transfer GPU-GPU directly ?. Both GPUs are C2070.



Thanks.

#1
Posted 11/08/2011 10:53 PM   
Max bandwidth for your PCIe x16 Gen2 Tesla C2070 is 8GB/s each way. So both reported bandwidth are suspect to me. I seem to recall running into reporting issues in the NVIDA oclBandwithTest application last year. Perhaps there is still a bug there if that is what you are using.
Max bandwidth for your PCIe x16 Gen2 Tesla C2070 is 8GB/s each way. So both reported bandwidth are suspect to me. I seem to recall running into reporting issues in the NVIDA oclBandwithTest application last year. Perhaps there is still a bug there if that is what you are using.

#2
Posted 12/07/2011 08:54 PM   
The actual practical limit around 5.5GB/s on pinned memory, 8GB/s is theoretical and doesn't take into account communication protocol overhead.

As for GPU-GPU transfers, if both GPUs are on the same PCI-E bus, the OS is 64bit and if under windows 7, the GPUs are in TCC mode (under linux / XP it's not needed), and the application is compiled for 64bits, you can do a direct copy between GPUs bypassing the CPU memory (check if Device supports Unified Addressing (UVA) is enabled). This is with CUDA 4
The actual practical limit around 5.5GB/s on pinned memory, 8GB/s is theoretical and doesn't take into account communication protocol overhead.



As for GPU-GPU transfers, if both GPUs are on the same PCI-E bus, the OS is 64bit and if under windows 7, the GPUs are in TCC mode (under linux / XP it's not needed), and the application is compiled for 64bits, you can do a direct copy between GPUs bypassing the CPU memory (check if Device supports Unified Addressing (UVA) is enabled). This is with CUDA 4

#3
Posted 12/11/2011 02:17 PM   
[quote name='laughingrice' date='11 December 2011 - 09:17 AM' timestamp='1323613040' post='1340738']
The actual practical limit around 5.5GB/s on pinned memory, 8GB/s is theoretical and doesn't take into account communication protocol overhead.

As for GPU-GPU transfers, if both GPUs are on the same PCI-E bus, the OS is 64bit and if under windows 7, the GPUs are in TCC mode (under linux / XP it's not needed), and the application is compiled for 64bits, you can do a direct copy between GPUs bypassing the CPU memory (check if Device supports Unified Addressing (UVA) is enabled). This is with CUDA 4
[/quote]

Thanks, but how to implement it on OpenCL caz I was having issues ?
[quote name='laughingrice' date='11 December 2011 - 09:17 AM' timestamp='1323613040' post='1340738']

The actual practical limit around 5.5GB/s on pinned memory, 8GB/s is theoretical and doesn't take into account communication protocol overhead.



As for GPU-GPU transfers, if both GPUs are on the same PCI-E bus, the OS is 64bit and if under windows 7, the GPUs are in TCC mode (under linux / XP it's not needed), and the application is compiled for 64bits, you can do a direct copy between GPUs bypassing the CPU memory (check if Device supports Unified Addressing (UVA) is enabled). This is with CUDA 4





Thanks, but how to implement it on OpenCL caz I was having issues ?

#4
Posted 12/20/2011 05:08 PM   
Having an issue with what?
Having an issue with what?

#5
Posted 12/22/2011 12:01 AM   
[quote name='laughingrice' date='22 December 2011 - 05:31 AM' timestamp='1324512082' post='1345505']
Having an issue with what?
[/quote]


Respected Sir,
Recently I am facing a problem in OpenCL which I am notable find solution at the movement , well I am explaining the type of situation with an example below.
int previous_pixel;
fs=get_global_id(0);
us=fs%frame_width;
if(us==0)
previous_pixel=0;
if(input_buffer[fs]==some value)
previous_pixel=fs;
else if(previous_pixel!=0)
{
//operation being done
H=fs+some_value;
calculate some value “h” here than,
previous_pixel=h;
}

well this is the problem I am facing, how can I solve this dependency problem for previous_pixel. Its taking the value zero for all threads.
Thanks in advance
Best regards
Megharaj
[quote name='laughingrice' date='22 December 2011 - 05:31 AM' timestamp='1324512082' post='1345505']

Having an issue with what?







Respected Sir,

Recently I am facing a problem in OpenCL which I am notable find solution at the movement , well I am explaining the type of situation with an example below.

int previous_pixel;

fs=get_global_id(0);

us=fs%frame_width;

if(us==0)

previous_pixel=0;

if(input_buffer[fs]==some value)

previous_pixel=fs;

else if(previous_pixel!=0)

{

//operation being done

H=fs+some_value;

calculate some value “h” here than,

previous_pixel=h;

}



well this is the problem I am facing, how can I solve this dependency problem for previous_pixel. Its taking the value zero for all threads.

Thanks in advance

Best regards

Megharaj

#6
Posted 01/24/2012 10:30 AM   
I'm not fully following the code, but it doesn't seem that previous pixel is set anywhere if it's not set to zero, so it takes on some junk value that might very well be zero
I'm not fully following the code, but it doesn't seem that previous pixel is set anywhere if it's not set to zero, so it takes on some junk value that might very well be zero

#7
Posted 01/29/2012 01:49 PM   
[quote name='laughingrice' date='29 January 2012 - 07:19 PM' timestamp='1327844958' post='1362193']
I'm not fully following the code, but it doesn't seem that previous pixel is set anywhere if it's not set to zero, so it takes on some junk value that might very well be zero
[/quote]


I got the solution for that , I am running one row in one thread because dependency was there only inside a row, each row were independent so got the solution as shown below.

Initially I was doing like this
Global_worksize=(width*height)
__kernel(…)
{
Id=get_global_id(0);
//operation
}

Now
Global_worksize=(height)//taking only height number of threads
__kernel(…)
{
Id=get_global_id(0);
For(i=0;i<width;i++)
{
//operation

}//end of for loop
}
[quote name='laughingrice' date='29 January 2012 - 07:19 PM' timestamp='1327844958' post='1362193']

I'm not fully following the code, but it doesn't seem that previous pixel is set anywhere if it's not set to zero, so it takes on some junk value that might very well be zero







I got the solution for that , I am running one row in one thread because dependency was there only inside a row, each row were independent so got the solution as shown below.



Initially I was doing like this

Global_worksize=(width*height)

__kernel(…)

{

Id=get_global_id(0);

//operation

}



Now

Global_worksize=(height)//taking only height number of threads

__kernel(…)

{

Id=get_global_id(0);

For(i=0;i<width;i++)

{

//operation



}//end of for loop

}

#8
Posted 01/30/2012 05:27 AM   
Scroll To Top