Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel,
  1 / 3    
Hello,

I am using a GPU (Tesla for the time being) for radar data processing. In this scenario some bespoke hardware, a PCIe plugin card captures and accumulates radar data. Once a certain amount of data is accumulated (4MB), it has to be transferred to the GPU for processing (every 8ms). The challenge is both to get the data onto the GPU and to carry out the algorithms in a timely fashion.

Ideally I would like to trigger the data transfer between the data capture card onto the GPU directly and be notified once the transfer has completed. So the CPU only orchestrates but is not involved in data shuffling. I understand that there is no such support in the NVidia driver for peer to peer data transfer so far, but could there be? I could imagine that many applications face a similar problem, and it is simple not efficient nor is it elegant to transfer all data to RAM first.

Failing a peer to peer data transfer, my driver for the data capture card transfers data to reserved, page-locked kernel memory, which I can mmap into user space. Now, how do I efficiently copy the data to the GPU?

1. host to device copy from the mmaped region. This might be slow.
2. copy all data from the mmaped region into a cuda-host-malloced area and then host to device copy.

Is there a more elegant way of transferring data that lives in a page locked memory area?
Could I tell the nvidia driver to use my mmapped region as if it was cuda-host-malloced?

Last, what if I would like to do all that from the kernel? Could I talk to the GPU from a kernel module, i.e. is the API available in kernel space? Could I launch kernels from a kernel module?

My aim is to write a as-real-time-as-it-gets application that transfers data from the data capture card to the GPU and launches a few kernels every 8ms. It won't be much code, but it has to execute in a deterministic fashion. That's why I would prefer to stay in the kernel, synchronised to the data capture card.

Many thanks for any hints.
Kind regards
Peter
Hello,



I am using a GPU (Tesla for the time being) for radar data processing. In this scenario some bespoke hardware, a PCIe plugin card captures and accumulates radar data. Once a certain amount of data is accumulated (4MB), it has to be transferred to the GPU for processing (every 8ms). The challenge is both to get the data onto the GPU and to carry out the algorithms in a timely fashion.



Ideally I would like to trigger the data transfer between the data capture card onto the GPU directly and be notified once the transfer has completed. So the CPU only orchestrates but is not involved in data shuffling. I understand that there is no such support in the NVidia driver for peer to peer data transfer so far, but could there be? I could imagine that many applications face a similar problem, and it is simple not efficient nor is it elegant to transfer all data to RAM first.



Failing a peer to peer data transfer, my driver for the data capture card transfers data to reserved, page-locked kernel memory, which I can mmap into user space. Now, how do I efficiently copy the data to the GPU?



1. host to device copy from the mmaped region. This might be slow.

2. copy all data from the mmaped region into a cuda-host-malloced area and then host to device copy.



Is there a more elegant way of transferring data that lives in a page locked memory area?

Could I tell the nvidia driver to use my mmapped region as if it was cuda-host-malloced?



Last, what if I would like to do all that from the kernel? Could I talk to the GPU from a kernel module, i.e. is the API available in kernel space? Could I launch kernels from a kernel module?



My aim is to write a as-real-time-as-it-gets application that transfers data from the data capture card to the GPU and launches a few kernels every 8ms. It won't be much code, but it has to execute in a deterministic fashion. That's why I would prefer to stay in the kernel, synchronised to the data capture card.



Many thanks for any hints.

Kind regards

Peter

#1
Posted 06/09/2010 02:32 PM   
Hello,

are my questions too unusual, strange, or out of range?

Nvidia, can I copy data from a bespoke PCIe card onto the Tesla, e.g. by setting up a DMA transfer, or some prefetch mechanisms? Perhaps not now, but real soon?

This is very vital to our application and I would be very gla dif it was possible.

Cheers,
peter
Hello,



are my questions too unusual, strange, or out of range?



Nvidia, can I copy data from a bespoke PCIe card onto the Tesla, e.g. by setting up a DMA transfer, or some prefetch mechanisms? Perhaps not now, but real soon?



This is very vital to our application and I would be very gla dif it was possible.



Cheers,

peter

#2
Posted 06/10/2010 02:01 PM   
[quote name='PeterW' post='1071344' date='Jun 10 2010, 04:01 PM']Hello,

are my questions too unusual, strange, or out of range?

Nvidia, can I copy data from a bespoke PCIe card onto the Tesla, e.g. by setting up a DMA transfer, or some prefetch mechanisms? Perhaps not now, but real soon?

This is very vital to our application and I would be very gla dif it was possible.

Cheers,
peter[/quote]
I would check the inifinband thread from a few days ago. For infiniband something like this is possible, and someone posted that on linux a small modification of the open-source part of the nvidia kernel driver should also allow this.
[quote name='PeterW' post='1071344' date='Jun 10 2010, 04:01 PM']Hello,



are my questions too unusual, strange, or out of range?



Nvidia, can I copy data from a bespoke PCIe card onto the Tesla, e.g. by setting up a DMA transfer, or some prefetch mechanisms? Perhaps not now, but real soon?



This is very vital to our application and I would be very gla dif it was possible.



Cheers,

peter

I would check the inifinband thread from a few days ago. For infiniband something like this is possible, and someone posted that on linux a small modification of the open-source part of the nvidia kernel driver should also allow this.

greets,
Denis

#3
Posted 06/10/2010 03:16 PM   
Unfortunately I am not able to find the thread you mention, neither by looking through the past few weeks, nor by the Search this forum mechanism. Would you have a link to the thread, or its title?
Cheers,
peter
Unfortunately I am not able to find the thread you mention, neither by looking through the past few weeks, nor by the Search this forum mechanism. Would you have a link to the thread, or its title?

Cheers,

peter

#4
Posted 06/10/2010 04:06 PM   
[quote name='PeterW' post='1071403' date='Jun 10 2010, 06:06 PM']Unfortunately I am not able to find the thread you mention, neither by looking through the past few weeks, nor by the Search this forum mechanism. Would you have a link to the thread, or its title?
Cheers,
peter[/quote]

google works better than the forum search :) : [url="https://www.google.com/search?hl=en&lr=&as_qdr=all&q=site:forums.nvidia.com+infiniband&aq=f&aqi=&aql=&oq=&gs_rfai="]https://www.google.com/search?hl=en&lr=...q=&gs_rfai=[/url]

This is the topic: [url="http://forums.nvidia.com/index.php?showtopic=170188&pid=1069704&st=0&#entry1069704"]http://forums.nvidia.com/index.php?showtop...p;#entry1069704[/url]
[quote name='PeterW' post='1071403' date='Jun 10 2010, 06:06 PM']Unfortunately I am not able to find the thread you mention, neither by looking through the past few weeks, nor by the Search this forum mechanism. Would you have a link to the thread, or its title?

Cheers,

peter



google works better than the forum search :) : https://www.google.com/search?hl=en&lr=...q=&gs_rfai=



This is the topic: http://forums.nvidia.com/index.php?showtop...p;#entry1069704

greets,
Denis

#5
Posted 06/11/2010 08:40 PM   
Many thanks for the link; it looks interesting, for people using Infiniband though.

It still does not address what I would like to achieve: Get a data acquisition card copy captured data directly onto the GPU card. I may be completely mistaken and this kind of peer to peer copy is not possible on PCIe, or it depends on the PCIe hub, and who can become bus master. But the bottom line is: GPUs can be used in many wasy, one important one is processing of captured data, which in many cases is accomplished by some PCIe peripheral or FPGA card. What is the point of having 1TFLOPS if you cannot get the data onto the system?

As mentioned earlier, if I cannot have a peer to peer copy, then I have to find an efficient way of transferring captured data onto the GPU with a minium amount of copying. Since I capture the data into memory reserved at boot time, and efficient transfer of data to the GPU only works well with cudaMalloc'ed memory, how can I tell cudaUseMemoryAsPinned()?

Cheers,
peter
Many thanks for the link; it looks interesting, for people using Infiniband though.



It still does not address what I would like to achieve: Get a data acquisition card copy captured data directly onto the GPU card. I may be completely mistaken and this kind of peer to peer copy is not possible on PCIe, or it depends on the PCIe hub, and who can become bus master. But the bottom line is: GPUs can be used in many wasy, one important one is processing of captured data, which in many cases is accomplished by some PCIe peripheral or FPGA card. What is the point of having 1TFLOPS if you cannot get the data onto the system?



As mentioned earlier, if I cannot have a peer to peer copy, then I have to find an efficient way of transferring captured data onto the GPU with a minium amount of copying. Since I capture the data into memory reserved at boot time, and efficient transfer of data to the GPU only works well with cudaMalloc'ed memory, how can I tell cudaUseMemoryAsPinned()?



Cheers,

peter

#6
Posted 06/14/2010 01:10 PM   
[quote name='PeterW' post='1073246' date='Jun 14 2010, 03:10 PM']Many thanks for the link; it looks interesting, for people using Infiniband though.

It still does not address what I would like to achieve: Get a data acquisition card copy captured data directly onto the GPU card. I may be completely mistaken and this kind of peer to peer copy is not possible on PCIe, or it depends on the PCIe hub, and who can become bus master. But the bottom line is: GPUs can be used in many wasy, one important one is processing of captured data, which in many cases is accomplished by some PCIe peripheral or FPGA card. What is the point of having 1TFLOPS if you cannot get the data onto the system?

As mentioned earlier, if I cannot have a peer to peer copy, then I have to find an efficient way of transferring captured data onto the GPU with a minium amount of copying. Since I capture the data into memory reserved at boot time, and efficient transfer of data to the GPU only works well with cudaMalloc'ed memory, how can I tell cudaUseMemoryAsPinned()?

Cheers,
peter[/quote]
Did you read the post about changing the low-level allocator in the driver? There is a post about how the infiniband trick was already possible by changing some part of the driver. I think it should that way be possible to map a region of your capture card into (pinned) memory and use that pinned memory either with cudamemcpyasync or directly in the kernel through means of zero-copy.
[quote name='PeterW' post='1073246' date='Jun 14 2010, 03:10 PM']Many thanks for the link; it looks interesting, for people using Infiniband though.



It still does not address what I would like to achieve: Get a data acquisition card copy captured data directly onto the GPU card. I may be completely mistaken and this kind of peer to peer copy is not possible on PCIe, or it depends on the PCIe hub, and who can become bus master. But the bottom line is: GPUs can be used in many wasy, one important one is processing of captured data, which in many cases is accomplished by some PCIe peripheral or FPGA card. What is the point of having 1TFLOPS if you cannot get the data onto the system?



As mentioned earlier, if I cannot have a peer to peer copy, then I have to find an efficient way of transferring captured data onto the GPU with a minium amount of copying. Since I capture the data into memory reserved at boot time, and efficient transfer of data to the GPU only works well with cudaMalloc'ed memory, how can I tell cudaUseMemoryAsPinned()?



Cheers,

peter

Did you read the post about changing the low-level allocator in the driver? There is a post about how the infiniband trick was already possible by changing some part of the driver. I think it should that way be possible to map a region of your capture card into (pinned) memory and use that pinned memory either with cudamemcpyasync or directly in the kernel through means of zero-copy.

greets,
Denis

#7
Posted 06/14/2010 01:37 PM   
Hello again,

[quote name='E.D. Riedijk' post='1073261' date='Jun 14 2010, 02:37 PM']Did you read the post about changing the low-level allocator in the driver? There is a post about how the infiniband trick was already possible by changing some part of the driver.[/quote]
Well, I read the post that mentioned changing the low-level allocator in the driver. But I did not see any links to code, or instructions on how to actuall do it.

[quote name='E.D. Riedijk' post='1073261' date='Jun 14 2010, 02:37 PM']I think it should that way be possible to map a region of your capture card into (pinned) memory and use that pinned memory either with cudamemcpyasync ...[/quote]

Do you really mean to mmap a region of my capture card, e.g. by implementing the mmap file ops, i.e. the PCIe device memory window is exposed to user space. If I then issued a cudamemcpyasync from this reagion onto a cuda device memory address, would the CUDA driver really carry out a device to device copy? How would the cudamemcpyasync call know that this is pinned memory? Does cudamemcpyasync carry out tests before copying to find out if the memory is in DMA-able regions and the physical pages are contiguous?
Or did you mean memory allocated by the driver but still RAM and not IO space?

[quote name='E.D. Riedijk' post='1073261' date='Jun 14 2010, 02:37 PM']... or directly in the kernel through means of zero-copy.[/quote]
I think I do not quite understand. Does this mean copying from one card onto the GPU without copying data to RAM, or does it mean, that the GPU can work off the capture card memory directly, so really zero copy?

I am more confused than ever.
Cheers,
peter
Hello again,



[quote name='E.D. Riedijk' post='1073261' date='Jun 14 2010, 02:37 PM']Did you read the post about changing the low-level allocator in the driver? There is a post about how the infiniband trick was already possible by changing some part of the driver.

Well, I read the post that mentioned changing the low-level allocator in the driver. But I did not see any links to code, or instructions on how to actuall do it.



[quote name='E.D. Riedijk' post='1073261' date='Jun 14 2010, 02:37 PM']I think it should that way be possible to map a region of your capture card into (pinned) memory and use that pinned memory either with cudamemcpyasync ...



Do you really mean to mmap a region of my capture card, e.g. by implementing the mmap file ops, i.e. the PCIe device memory window is exposed to user space. If I then issued a cudamemcpyasync from this reagion onto a cuda device memory address, would the CUDA driver really carry out a device to device copy? How would the cudamemcpyasync call know that this is pinned memory? Does cudamemcpyasync carry out tests before copying to find out if the memory is in DMA-able regions and the physical pages are contiguous?

Or did you mean memory allocated by the driver but still RAM and not IO space?



[quote name='E.D. Riedijk' post='1073261' date='Jun 14 2010, 02:37 PM']... or directly in the kernel through means of zero-copy.

I think I do not quite understand. Does this mean copying from one card onto the GPU without copying data to RAM, or does it mean, that the GPU can work off the capture card memory directly, so really zero copy?



I am more confused than ever.

Cheers,

peter

#8
Posted 06/14/2010 03:59 PM   
Are you still in a planning stage, or do you have some initial results that are problematic? How far are you from reaching the 8ms deadline? Does it break consistently or only sometimes, in which case the RT patches might help (actually, I believe that would be a prerequisite.)
Are you still in a planning stage, or do you have some initial results that are problematic? How far are you from reaching the 8ms deadline? Does it break consistently or only sometimes, in which case the RT patches might help (actually, I believe that would be a prerequisite.)

#9
Posted 06/14/2010 05:48 PM   
I'm not sure if it's what you're looking for, but I stumbled upon [url="http://www.opengl.org/registry/specs/NV/video_capture.txt"]this OpenGL extension[/url].
Never tried it myself, but if it works you could use cuda-gl interop.

N.
I'm not sure if it's what you're looking for, but I stumbled upon this OpenGL extension.

Never tried it myself, but if it works you could use cuda-gl interop.



N.

#10
Posted 06/14/2010 06:46 PM   
As a matter of fact, I am working on a the implementation of a smaller Proof of Concept system.

Data have to be collected in smaller chunks from the FPGA card at ~ 2KHz. I started off by using a userIO based driver that catches an interrupt and wakes up a process having selected the corresponding device file. Once woken up, the user space part read from the capture card's memory window, exposed to user space by the userIO driver. But it turned out to be not reliable enough, wakeup at 2kHz made the user space miss a few windows at no load and quite a few at more load.

So I now remain in the kernel and accumulate data into reserved memory. Every 8ms, a data frame is complete and I can wake up a "Tesla" process, I have to transfer the entire frame onto the GPU for processing. You are right, I will have to use some RT measures, set scheduling prioities or so.

For the small system the memory transfers are not so much of an issue, but the full system will have to cope with 2GB/s and here I would wish to be able to have the data transferred moreless "automagically" from the data capture card onto the Tesla at a certain frequency. And a user space program would be woken up only every 8ms tp launch the CUDA kernels on the new data.

Cheers,
peter
As a matter of fact, I am working on a the implementation of a smaller Proof of Concept system.



Data have to be collected in smaller chunks from the FPGA card at ~ 2KHz. I started off by using a userIO based driver that catches an interrupt and wakes up a process having selected the corresponding device file. Once woken up, the user space part read from the capture card's memory window, exposed to user space by the userIO driver. But it turned out to be not reliable enough, wakeup at 2kHz made the user space miss a few windows at no load and quite a few at more load.



So I now remain in the kernel and accumulate data into reserved memory. Every 8ms, a data frame is complete and I can wake up a "Tesla" process, I have to transfer the entire frame onto the GPU for processing. You are right, I will have to use some RT measures, set scheduling prioities or so.



For the small system the memory transfers are not so much of an issue, but the full system will have to cope with 2GB/s and here I would wish to be able to have the data transferred moreless "automagically" from the data capture card onto the Tesla at a certain frequency. And a user space program would be woken up only every 8ms tp launch the CUDA kernels on the new data.



Cheers,

peter

#11
Posted 06/15/2010 09:49 AM   
Thanks for the hint. I am not quite sure whether I understand the purpose of the extension; must be something around video capture from a video capture device, possibly a device on the system bus and not on the graphics card itself. I could not quite see how a data pipeline between two devices on a bus could be established, in particular if the two devices are quite different, one GPU card with an NVidia driver, and in my case an FPGA card with its own driver.

For me it starts boiling down to the following questions:

- Does PCIe allow for peer to peer data transfers, and if so, what support is required from participating cards.
- Can I request a transfer of a data block from one PCIe device to another PCIe device or would this be triggered automagically with a memcpy()?
- If I know the source PCIe memory window of my capture card, how do I get the destination memory window of a preallocated memory region on the GPU?

Cheers,
Peter
Thanks for the hint. I am not quite sure whether I understand the purpose of the extension; must be something around video capture from a video capture device, possibly a device on the system bus and not on the graphics card itself. I could not quite see how a data pipeline between two devices on a bus could be established, in particular if the two devices are quite different, one GPU card with an NVidia driver, and in my case an FPGA card with its own driver.



For me it starts boiling down to the following questions:



- Does PCIe allow for peer to peer data transfers, and if so, what support is required from participating cards.

- Can I request a transfer of a data block from one PCIe device to another PCIe device or would this be triggered automagically with a memcpy()?

- If I know the source PCIe memory window of my capture card, how do I get the destination memory window of a preallocated memory region on the GPU?



Cheers,

Peter

#12
Posted 06/15/2010 10:07 AM   
[quote name='PeterW' post='1073717' date='Jun 15 2010, 12:07 PM']- If I know the source PCIe memory window of my capture card, how do I get the destination memory window of a preallocated memory region on the GPU?[/quote]
Think the other way around. If you have a pointer to (non-swappable) memory that actually points to the memory window of your pcie card, you can have your GPU DMA that memory in one go from the device to the GPU.
I have no idea if you can map that pcie-card memory into the normal memory space, but I can imagine that it will default be non-swappable. Then the only thing you need to do is:
- modify cudaMallocHost to also allow a pointer given by you as input, and let CUDA think it is standard pinned memory, that is where the driver-mod comes in.

A cudamemCpy is actually a DMA by the GPU from non-swappable memory. If you have allocated normal memory, the runtime make a small pinned memory buffer, transfers some data to that buffer, let the GPU fire off a DMA. Then it copies the next bit, etc, etc.
[quote name='PeterW' post='1073717' date='Jun 15 2010, 12:07 PM']- If I know the source PCIe memory window of my capture card, how do I get the destination memory window of a preallocated memory region on the GPU?

Think the other way around. If you have a pointer to (non-swappable) memory that actually points to the memory window of your pcie card, you can have your GPU DMA that memory in one go from the device to the GPU.

I have no idea if you can map that pcie-card memory into the normal memory space, but I can imagine that it will default be non-swappable. Then the only thing you need to do is:

- modify cudaMallocHost to also allow a pointer given by you as input, and let CUDA think it is standard pinned memory, that is where the driver-mod comes in.



A cudamemCpy is actually a DMA by the GPU from non-swappable memory. If you have allocated normal memory, the runtime make a small pinned memory buffer, transfers some data to that buffer, let the GPU fire off a DMA. Then it copies the next bit, etc, etc.

greets,
Denis

#13
Posted 06/16/2010 05:23 AM   
Hello and many thanks for the mind opener!

[quote name='E.D. Riedijk' post='1074210' date='Jun 16 2010, 06:23 AM']Think the other way around. If you have a pointer to (non-swappable) memory that actually points to the memory window of your pcie card, you can have your GPU DMA that memory in one go from the device to the GPU.[/quote]
I always assumed that DMA transfer from the host to the GPU is controlled by the host, i.e. by the NVidia driver. But Yes, the GPU can DMA and be the bus master, too. If I understand you correctly, this is what's happening in a cudaMemCpy()

[quote name='E.D. Riedijk' post='1074210' date='Jun 16 2010, 06:23 AM']I have no idea if you can map that pcie-card memory into the normal memory space, but I can imagine that it will default be non-swappable.[/quote]
The PCIe-card memory can certainly be ioremap'ed in the kernel into kernel virtual adresses, and by an mmap file operation of a char driver into user space.

[quote name='E.D. Riedijk' post='1074210' date='Jun 16 2010, 06:23 AM']- modify cudaMallocHost to also allow a pointer given by you as input, and let CUDA think it is standard pinned memory, that is where the driver-mod comes in.[/quote]
So memory addresses have "colours", i.e. cudaMemCpy() can tell from the address itself whether it is pinned or swappable?

If the driver for my PCIe-card implements the mmap file operation and this function uses remap_pfn_range to map the PCIe-card memory window into user space, would the resulting memory address look like a pinned memory address to cudaMemCpy()? Would cudaMemCpy consequently make use of DMA transfer from the source address (on the PCIe card) onto the GPU? In this case I would not need to modify the cudaMallocHost() function?

If not, cudaMallocHost is a library function and I do not have the source code, so it would be difficult to modify it?

Cheers,
peter
Hello and many thanks for the mind opener!



[quote name='E.D. Riedijk' post='1074210' date='Jun 16 2010, 06:23 AM']Think the other way around. If you have a pointer to (non-swappable) memory that actually points to the memory window of your pcie card, you can have your GPU DMA that memory in one go from the device to the GPU.

I always assumed that DMA transfer from the host to the GPU is controlled by the host, i.e. by the NVidia driver. But Yes, the GPU can DMA and be the bus master, too. If I understand you correctly, this is what's happening in a cudaMemCpy()



[quote name='E.D. Riedijk' post='1074210' date='Jun 16 2010, 06:23 AM']I have no idea if you can map that pcie-card memory into the normal memory space, but I can imagine that it will default be non-swappable.

The PCIe-card memory can certainly be ioremap'ed in the kernel into kernel virtual adresses, and by an mmap file operation of a char driver into user space.



[quote name='E.D. Riedijk' post='1074210' date='Jun 16 2010, 06:23 AM']- modify cudaMallocHost to also allow a pointer given by you as input, and let CUDA think it is standard pinned memory, that is where the driver-mod comes in.

So memory addresses have "colours", i.e. cudaMemCpy() can tell from the address itself whether it is pinned or swappable?



If the driver for my PCIe-card implements the mmap file operation and this function uses remap_pfn_range to map the PCIe-card memory window into user space, would the resulting memory address look like a pinned memory address to cudaMemCpy()? Would cudaMemCpy consequently make use of DMA transfer from the source address (on the PCIe card) onto the GPU? In this case I would not need to modify the cudaMallocHost() function?



If not, cudaMallocHost is a library function and I do not have the source code, so it would be difficult to modify it?



Cheers,

peter

#14
Posted 06/16/2010 03:24 PM   
Hello,

following E.D. Riedijk's explanations, I tried out three different types of memory transfers to the GPU for a 16MB block:

1. host malloc'ed memory
2. cudaMallocHost (pinned) memory
3. PCIe device memory

For the last case, I have implemented the mmap file operation of my Xilinx PCIe card driver to map the PCI memory window:

[code]static int xilinxpcie_init(void) {
:
pcie_mem_hwaddr = pci_resource_start(pcie_dev, 0);
pcie_mem_length = pci_resource_len(pcie_dev, 0);
:
}

static int xilinxpcie_mmap(struct file *filp, struct vm_area_struct *vma) {
:
remap_pfn_range(vma, vma->vm_start,
pcie_mem_hwaddr >> PAGE_SHIFT,
pcie_mem_length, vma->vm_page_prot);
:
}[/code]

The user space program mmaps the region and then calls a cudaMemcpy:

[code]int length = 16*1024*1024;
int fd = open("/dev/xilinxpcie", O_RDWR | O_SYNC);
h_data_mmap = mmap(0, length, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 4096);
h_data_malloc = (char *)malloc(length);
cudaMallocHost((void **)&h_data_cudaMalloc, length);
cutilSafeCall(cudaMalloc((void**)&d_data, length));
:
if (type == MEMORY_TYPE_MMAP) {
cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_mmap, length, cudaMemcpyHostToDevice));
}
else if (type == MEMORY_TYPE_MALLOC) {
memcpy(h_data_malloc, h_data_mmap, length);
cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_malloc, length, cudaMemcpyHostToDevice));
}
else /*if (type == MEMORY_TYPE_CUDAMALLOC)*/ {
memcpy(h_data_cudaMalloc, h_data_mmap, length);
cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_cudaMalloc, length, cudaMemcpyHostToDevice));
}[/code]

What I have measured for 16MB of data is:

A. To transfer 16MB from the PCIe xilinx card to RAM using memcpy into

1. malloc'ed host memory: [b]1711ms[/b]
2. cudaMallocHost pinned memory: [b]1701ms[/b]

B. Transfer from host to device takes

1. malloc'ed host to device transfer: [b]9ms[/b] = 1777MB/s
2. cudaMallocHost host to device transfer : [b]5ms[/b] = 3200MB/s

C. Put together (1. and 2. require a memcpy first):

1. device to device through host malloc'ed memory: [b]1720ms[/b]
2. device to device through cudaMallocHost (pinned) memory: [b]1706ms[/b]
3. PCIe device to device transfer: [b]880ms[/b]

So it appears that the peer to peer transfer is twice as fast as the transfer through the host, even though the transfer speed is still disappointing.

The conclusion is it might be better to DMA data from a PCIe data acquisition card into RAM first (hopefully into cudaMallocHost pinned memory, usuable by the xilinx driver?) and then transfer onto the GPU.

I hope this findings are usefull to some,
peter
Hello,



following E.D. Riedijk's explanations, I tried out three different types of memory transfers to the GPU for a 16MB block:



1. host malloc'ed memory

2. cudaMallocHost (pinned) memory

3. PCIe device memory



For the last case, I have implemented the mmap file operation of my Xilinx PCIe card driver to map the PCI memory window:



static int xilinxpcie_init(void) {

:

pcie_mem_hwaddr = pci_resource_start(pcie_dev, 0);

pcie_mem_length = pci_resource_len(pcie_dev, 0);

:

}



static int xilinxpcie_mmap(struct file *filp, struct vm_area_struct *vma) {

:

remap_pfn_range(vma, vma->vm_start,

pcie_mem_hwaddr >> PAGE_SHIFT,

pcie_mem_length, vma->vm_page_prot);

:

}




The user space program mmaps the region and then calls a cudaMemcpy:



int length = 16*1024*1024;

int fd = open("/dev/xilinxpcie", O_RDWR | O_SYNC);

h_data_mmap = mmap(0, length, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 4096);

h_data_malloc = (char *)malloc(length);

cudaMallocHost((void **)&h_data_cudaMalloc, length);

cutilSafeCall(cudaMalloc((void**)&d_data, length));

:

if (type == MEMORY_TYPE_MMAP) {

cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_mmap, length, cudaMemcpyHostToDevice));

}

else if (type == MEMORY_TYPE_MALLOC) {

memcpy(h_data_malloc, h_data_mmap, length);

cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_malloc, length, cudaMemcpyHostToDevice));

}

else /*if (type == MEMORY_TYPE_CUDAMALLOC)*/ {

memcpy(h_data_cudaMalloc, h_data_mmap, length);

cutilSafeCall(cudaMemcpy(d_data, (void *)h_data_cudaMalloc, length, cudaMemcpyHostToDevice));

}




What I have measured for 16MB of data is:



A. To transfer 16MB from the PCIe xilinx card to RAM using memcpy into



1. malloc'ed host memory: 1711ms

2. cudaMallocHost pinned memory: 1701ms



B. Transfer from host to device takes



1. malloc'ed host to device transfer: 9ms = 1777MB/s

2. cudaMallocHost host to device transfer : 5ms = 3200MB/s



C. Put together (1. and 2. require a memcpy first):



1. device to device through host malloc'ed memory: 1720ms

2. device to device through cudaMallocHost (pinned) memory: 1706ms

3. PCIe device to device transfer: 880ms



So it appears that the peer to peer transfer is twice as fast as the transfer through the host, even though the transfer speed is still disappointing.



The conclusion is it might be better to DMA data from a PCIe data acquisition card into RAM first (hopefully into cudaMallocHost pinned memory, usuable by the xilinx driver?) and then transfer onto the GPU.



I hope this findings are usefull to some,

peter

#15
Posted 06/17/2010 02:05 PM   
  1 / 3    
Scroll To Top