about how to set GPU max payload size, please help me

lichangqing.lichangqing · November 28, 2017, 7:40am

The max payload size (packet size) is the lower of the max payload size supported by the root complex (i.e. motherboard) and the max payload size supported by the endpoint (i.e. GPU). You can inspect these values directly using lspci on linux. On the particular Dell workstation (T3500) that I happened to look at, the (root complex) max payload size was not a BIOS adjustable option (although it may be on some motherboards). Using lspci -vvvx, I could see that the max payload size supported by the root complex was 256 bytes, whereas the max supported by the GPU was 128 bytes, and so 128 bytes was the configured value.
The choice of 128 made by the GPU is probably a compromise. If there are a mix of large and small packets, choosing a very large size (like 4096, the max supported by PCIE) would provide a benefit to these large transfers but could otherwise “penalize” short message PCIE traffic.

      We are using telsa p4,i want test the best bandwidth of one p4 ,but the 128 BYTE(i want it to 256byte) max payload size is down the efficient.i try to set the Device Control register controls PCI Express device specific parameters.but  it does not work , please give a help ,thanks very much

njuffa · November 28, 2017, 7:56am

To the best of my knowledge, the maximum PCIe payload size of the GPU is not user configurable. Why are you attempting to change this value? What real-world problem are you trying to address?

You can achieve just slightly over 12 GB/sec full duplex across PCIe gen3 x16 if you send the data in large enough blocks. With NVIDIA GPUs, this maximum achievable throughput rate is typically reached when transfer size is in the 8 MB to 16 MB region.

lichangqing.lichangqing · November 28, 2017, 8:09am

firstly thanks your reply very much,we are using CPU +FPGA +GPU to visit the remote GPU(another board), we want get the best bandwidth size ,the max payload size is just 128byte ,it cost to unuseful proportion, if we can set to 256byte ,the bandwidth value will become larger.can we change driver to set it ?

njuffa · November 28, 2017, 8:19am

As I recall, the PCIe packet header is 16 bytes, so increasing the packet payload from 128 to 256 bytes would theoretically provide 5.9% more bandwidth. I have a hard time imagining that this represents a make-or-break scenario for your use case.

I am reasonably certain (it has been quite a few years since I last dealt with this in detail) that the maximum PCIe packet size of the GPU is simply a function of the hardware and cannot be changed. The value could theoretically be different for different GPU architectures, so you might want check the latest GPUs available to see whether it has been increased. I don’t think it has but can’t be sure.

Are you writing your own GPUDirect driver for the FPGA to facilitate direct DMA transfers between GPU and FPGA?

lichangqing.lichangqing · November 28, 2017, 8:50am

non, we just passthrough the DMA GPU, thanks, we visit remote GPU of another board. so it cost 128byte + 16byte (head) + 16byte (crc) + 24byte (mac). another question we meet.now (pcie gen3 x8) , we test the num Host to Device Bandwidth is 5.5GB / s, Device to Host Bandwidth is 6GB / s, we see the HTOD is smaller, can we optimize it? we see ervery 20us, about 160ns bridge give the back pressure .can you help me?

njuffa · November 28, 2017, 9:01am

From observation, I know that maximum HtoD and DtoH rates sometimes differ a bit, just as you are observing. I do not know why that is, it seems to vary with GPU and host platform. Maybe a function of different amounts of buffering at the two end points? Are you using a modern high-end GPU (e.g. GTX 1080 Ti) in your tests?

BTW, your transfer rates look a little bit lower than I would expect for a x8 interface (around 6.3e9 bytes/second). That may be due to transfer sizes not being optimal?

lichangqing.lichangqing · December 7, 2017, 2:53am

First of all, thank you very much. My company is trying to use fpga to access the remote GPU, so we want the best performance. We tried, and now we get two problems affecting the bandwidth, one is 128byte payload. The other is the GPU tag. We get the gpu tag information by grabbing the waveform. The GPU has 256 tags, but in fact it uses 128 tags. Excuse me, can we manually set the actual number of tags used?

njuffa · December 7, 2017, 5:26am

I don’t know what these tags are that are being referenced. Presumably some low-level PCIe mechanism (e.g. part of a credit scheme?). How do you know that the “GPU has 256 tags”?

As its name indicates, this forum is for CUDA programming questions, As a consequence, 99% of the readers of this forum are probably software folks. It seems you need to get in touch with someone knowledgeable about GPU hardware, and the GPU’s PCIe interface in particular. Have you considered contacting your closest NVIDIA and requesting to be put in touch with a field application engineer?

lichangqing.lichangqing · December 7, 2017, 9:04am

ok，thanks
The total number of tags that are known by analyzing PCIE TLP packets is 8 bits, but up to 128 on the way. We are in mainland China, we are now in the verification stage, only bought a small amount of telsa card, technical support delay, can not wait to come up and ask, is there any hardware related forums to go for help?

谢谢，通过分析PCIE TLP数据包知道的tag总共使用了8bit，但是在途最大128。我们是中国大陆地区，我们现在处于验证阶段，只购买了少量的telsa卡，技术支持延迟很大，等不及了上来问一下，有没有硬件相关的论坛可以上去求助的？