nvcuvid performance bug on m4000 comparing to k2200

Hi NVIDIA dev forum admin or video team,

I am seeing weirdness on m4000 nvcuvid performance. I bet this is a sw issue or a hw issue.

This is my summary.

OS : Linux Ubuntu 14.04

GPU : k2200 (gm107) and m4000 (gm204)
Test app: NvTrascoder from nvidia video sdk 6 ( 6.0.1) samples
Test video : a 4k video, 1 min length
Driver version : 352.79 (which is included in cuda sdk 7.5) 351.28

Result:

[k2200]

./NvTranscoder -i /home/***/video/exid-updown_60s.mp4 -o test -deviceID 1
Encoding input           : "/home/***/video/exid-updown_60s.mp4"
         output          : "test"
         codec           : "H264"
         size            : 3840x1920
         bitrate         : 5000000 bits/sec
         vbvMaxBitrate   : 0 bits/sec
         vbvSize         : 0 bits
         fps             : 29 frames/sec
         rcMode          : CONSTQP
         goplength       : INFINITE GOP 
         B frames        : 0 
         QP              : 28 
         preset          : LOW_LATENCY_DEFAULT

Total time: 18085.107000ms, Decoded Frames: 1800, Encoded Frames: 1800, Average FPS: 99.529408

[m4000]

./NvTranscoder -i /home/***/video/exid-updown_60s.mp4 -o test -deviceID 0
Encoding input           : "/home/***/video/exid-updown_60s.mp4"
         output          : "test"
         codec           : "H264"
         size            : 3840x1920
         bitrate         : 5000000 bits/sec
         vbvMaxBitrate   : 0 bits/sec
         vbvSize         : 0 bits
         fps             : 29 frames/sec
         rcMode          : CONSTQP
         goplength       : INFINITE GOP 
         B frames        : 0 
         QP              : 28 
         preset          : LOW_LATENCY_DEFAULT

Total time: 22155.221000ms, Decoded Frames: 1800, Encoded Frames: 1800, Average FPS: 81.244958

Observations:

  • K2200 outperforms m4000

  • I tested two drivers as stated above → same result
    I tested with both gpus on one system, and then each gpu alone on the same system → same result

  • Found that nvenc is very cheap op. Decoder dominates the computing time.

  • NvTranscoder source code sets cudaVideoCreate_PreferCUVID to decoder but I got the same result when switching to cudaVideoCreate_PreferCUDA.

  • From this forum I found that cudaVideoCreate_PreferCUDA doesn’t always mean that the app uses cuda kernels. There’s a condition about using cuda instead of VP.

In NvDecodeGL.cpp

void displayHelp()
{
...
    printf("\t-decodecuda     - Use CUDA kernels for MPEG-2 (Available with 64+ CUDA cores)\n");
    printf("\t-decodecuvid    - Use NVDEC for MPEG-2, VC-1, H.264, or H.265 decode\n");
...
}

So, in this case, the app seems to use VP(NVDEC).

  • The weird thing is that k2200 is better than m4000 if VP is used. m4000’s VP is inferior to k2200’s? or the driver doesn’t control it correctly?

Per wiki Nvidia PureVideo - Wikipedia,
gm107 and gm204 have the same VP6. So, it makes sense the performance is same on either gm107 or gm204.

NVIDIA, could you take a look at this issue?

Please file a bug at developer.nvidia.com

I don’t know the link for bug reporting. Could you point me out?

If you haven’t already registered, you will need to register:

[url]https://developer.nvidia.com/accelerated-computing-developer[/url] (register or log in here)

Once your registration is processed and approved, or if you are already registered, then log in and you will have access to developer resources including the bug portal:

[url]https://developer.nvidia.com/accelerated-computing-developer-program-home[/url] (bug report link here)

Thanks txbob. I did, but I will keep this open in case any users have comments on this.

I haven’t received any updates from nvidia.
But I got one clue from my experiments.

The m4000’s clock speed is slower than k2200’s.

773 MHz vs. 1046 MHz

(Referring to NVIDIA Quadro M4000 Specs | TechPowerUp GPU Database and NVIDIA Quadro K2200 Specs | TechPowerUp GPU Database)

The clock ratio is proportional to the approx. perf ratio.
so this seems Not-a-bug.

Hi,

I have same results regarding M2000 and M4000 - M2000 can transcode more channels, than M4000. M2000 clock is higher, but M4000 have 2 nvdec/nvenc engines, so result ir very interesting.