NVDEC/CUDA/NVENC speed comparison

I would like to know how fast are current NVIDIA graphic cards NOT for gaming, but for encoding performance.

OUR TESTS:

[table]
[tr]
[td]card[/td]
[td]NVDEC[/td]
[td]NVENC H264[/td]
[td]NVENC H265[/td]
[td]CUDA DEINTERLACE[/td]
[/tr]
[tr]
[td]QUADRO M4000:[/td]
[td]1250[/td]
[td]2300*[/td]
[td]1200*[/td]
[td]4000[/td]
[/tr]
[tr]
[td]GTX 960:[/td]
[td]1800[/td]
[td]1800[/td]
[td]900[/td]
[td]3000[/td]
[/tr]
[tr]
[td]GTX 1060:[/td]
[td]2600[/td]
[td]2600[/td]
[td]1800[/td]
[td]4000[/td]
[/tr]
[tr]
[td]GTX 1070:[/td]
[td]2600[/td]
[td]2600[/td]
[td]1800[/td]
[td]5000[/td]
[/tr]
[tr]
[td]GTX 1080:[/td]
[td]2600[/td]
[td]5200*[/td]
[td]2600*[/td]
[td]10000[/td]
[/tr]
[/table]

Encoding and decoding is normalized to 720x576 resolution and units are FPS!

If you want to know speed for:
HD (1280x720) - divide all by 2
FHD (1920x1080) - divide all by 4

For example encoding on GTX 1070 in FHD quality to H264 will run 650 FPS

  • those cards have 2 NVENC engines, so speed for only one thread will be half

Comment 1 - Pascal generation has also better H265 quality!

There are more numbers “officially” from NVidia for more chips (kepler,maxwell g1, maxwell g2, pascal) and for many encoding parameters (quality vs. speed) - NVIDIA VIDEO CODEC SDK | NVIDIA Developer

I have seen all of those documents before we created this table, but i was unable to find which GTX (no Quadro) has 2xNVENC chipsets and also NVDEC/CUDA performance, so this could help somebody to know true power of those cards…

the gtx 1080 has 2x the threads for a total of 5200fps, do you know if streaming on obs (encoding h264 using nvenc) will be double the performance than gtx 1070 with 1x thread for 2600fps

Yes GTX 1080 has double performance over GTX 1070

Do you know if Gtx 1050 or 1050ti have same performance as gtx 1060 for h265?

Also do you think GTX 1070ti will be close to gtx 1080 or better for encoding h265?

Just wondering what your thoughts are, thanks and sorry for necro again.

How many NVENC engines does the new GTX 1070 Ti have?
Given that the GTX 1070Ti is a slightly cut down GTX 1080, I’m Keen to know whether it has 1 or 2 NVENC engines and whether both are enabled.

It’s strange, in the Video Encode and Decode GPU Support Matrix GeForce GTX 1070 - 1080 are listed with 2 NVENC. But in reality the GTX 1070 only contains one?

Hello Thunderm,

How did you test the NVDEC ?

I tried the steps as given in the following link and tried to playback a 5MP video file. It is not even decoding at 20fps. Any suggestions ?

Thanks,
Subbarao

We could achieve the decoding FPS as given in the NVIDIA decoder application notes.

Following command gave us the clue:

ffmpeg -i input.mp4 -f null /dev/null

Reference: https://stackoverflow.com/questions/20323640/ffmpeg-deocde-without-producing-output-file/20325676

Further tried hw_decode.c sample given in ffmpeg/doc/examples folder.
This took about 3 times more time to decode same input.mp4 file compared to the time taken for ffmpeg command given above.

Next modified the hw_decode.c as follows:

ret = avcodec_receive_frame(avctx, frame);
        if (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF) {
            av_frame_free(&frame);
            av_frame_free(&sw_frame);
            return 0;
        } else if (ret < 0) {
            fprintf(stderr, "Error while decoding\n");
            goto fail;
        }
#define QUICK_RELEASE
#ifdef QUICK_RELEASE
            av_frame_free(&frame);
            av_frame_free(&sw_frame);
            return 0;
#endif

        if (frame->format == hw_pix_fmt) {
            /* retrieve data from GPU to CPU */
            if ((ret = av_hwframe_transfer_data(sw_frame, frame, 0)) < 0) {
                fprintf(stderr, "Error transferring the data to system memory\n");
                goto fail;
            }
            tmp_frame = sw_frame;

Here the frame gets decoded and immediately released before transferring the decoded frame to host. After this the time taken by the program reduced by 3 times and matched with ffmpeg command.

So, the conclusion is that time to transfer data from GPU memory to motherboard memory is taking time. I feel that shared memory is the only way to overcome this. Any other suggestions ?

Thanks,
Subbarao