What's the situation with vdpau/vaapi/nvdec?

Hi

It seems this whole situation is starting to get a little bit more complex.

VDPAU hasn’t been updated for a while, and is now missing HEVC 10-bit and VP8/VP9 support. It seems this is never going to be updated given that NVIDIA now have nvdec.

The libva-vdpau-driver project gave us a VA-API implementation on-top of VDPAU. However, the GLX part of libva has seemingly been deprecated, with the new EGL/dma-buf method for passing buffers being the preferred method. EGL can’t be supported with the NVIDIA driver yet.

nvdec works great, but since it’s a proprietary API, not everything supports it.

Now, with Chrome and Firefox looking to (finally) use VA-API for hardware video decoding, it would be nice to have it take advantage of the GPU (esp. for laptop users). I’ve tried (somewhat successfully) to implement a VA-API backend for nvdec, however I fall over at the getting the decoded image to the screen.

Without support for EGL_EXT_image_dma_buf_import, and some way to export the handle from within nvdec/cuda, it doesn’t seem like there is any way to implement this.

Will there be a new version of VDPAU?

Is there a plan to implement EGL_EXT_image_dma_buf_import?, or is it tied up with the whole EGLStreams/allocator situation?

Will nvdec/cuda gain the ability to export one of these handles? Or would that need to be bounced through OpenGL? (I believe you can already share a buffer between the 2 APIs).

Thanks & Regards
elFarto

It would be nice to have some feedback from Nvidia about this, now that it exists chromium patch to be able to do hardware video decoding through VA-API.

dma_buf support falls pretty firmly into the same bucket as the allocator disagreement. They have a new vdpau maintainer (which is big step up from no vdpau maintainer) but it’s unclear if he’s in a position to take on anything very substantive (like new decode format support, 10bit or non-GL interop). Right now he’s working on a way for the existing OpenGL interop to return whole frames instead of fields.

CUDA has EGL interop but, of course, it’s based on EGLStreams, so you’d need to add a browser side code path to pull the images from a stream producer. That actually looks pretty simple, in terms of lines of code, but presumably will not be met with much enthusiasm.

I’ve worked with both the OpenGL and Vulkan interops in mpv and they’re perfectly functional, although they work backwards from what you’d naturally want (you have to export a buffer/texture from the GL/Vulkan side and then copy the nvdec frame contents on the cuda side).

Given that the browsers are, in practice, using GL for presentation, you could try and come up with a way to use GL interop to get frames out, but having looked at the firefox code for this, I know that there are multiple pretty strong abstractions between the presentation layer and the video decoder so making that direct GL connection would be messy.

So where does that leave us? vdpau definitely feels like a dead end at this point - both Intel and AMD are all in on vaapi, and vdpau has fallen behind in terms of support for the actual hardware (no 10bit, no vp9, etc, etc) and graphics frameworks (no Vulkan, no GL without GLX, etc). nvdec has all of that, but doesn’t support dma_buf based interop, although it has interops that could be used. If the browsers were using Vulkan, the story would actually be pretty good, as we’re not there yet.

So basically, the OP has a VA-API patch almost ready but needs an extension that’s not likely to ever materialize. If the browsers were using Vulkan, patch to use nvdec would be trivial, but none of that is on the horizon.

Are we screwed then?

You’ve worked on mpv and I have to thank you, I’m watching a stream through mpv (vo=gpu/vdpau) as I write this, so great job! Now I just wish we’d be back to the times of mplayerplug-in, when we had all the bleeding edge decoders embedded in the browser, however indirectly…

There is a possibility that we are. There seem to be three problems at this stage:

  • Lack of alignment on interop mechanisms
  • Convergence on VAAPI in the wider ecosystem
  • VDPAU falling away into irrelevance but no clear statements that consumers should switch to nvdec or that nvidia will advance VDPAU feature parity

All of these things make it hard to see what the correct way to approach nvidia support is, and even if you pick one, you can’t actually integrate in a simple way.

Also note that while the cuda/vulkan interop is fully standards based, it’s looking pretty clear that vaapi/vulkan interop will use a slightly different mechanism. Specifically, the cuda pattern is that you export a VkBuffer or VkImage from Vulkan using an OPAQUE_FD, and then copy the frame to the imported buffer/image in cuda code. On the vaapi side, I expect that we’ll see the frame exported using a DMA_BUF (of course), and then imported as a VkImage, and accessed in Vulkan. This puts us on a trajectory for having two code paths and unhappy devs, again.

You’re welcome. Any reason you’re still using vdpau vs nvdec?

Errr…let’s not over sell it. I have a bare-bones VA-API implementation on-top of NVDEC, that has just enough code to get mpv to decode a single frame of MPEG-2 video in CPU mode. I stopped when I realised that there was no way to get the GPU mode working.

Regards
elFarto

The original reason was tearing, then I moved to ForceFullCompositionPipeline=On to accelerate web browsers, so mpv could move to vo=gl/gpu but nvdec couldn’t decode HEVC on my GTX 980, while vdpau somehow did (while it theoretically shouldn’t ;))

I’m sure it’s long fixed, I just never came back to re-examine it (but I’m planning to get a 1080 Ti before they disappear from the market - I promise to revisit this then :))

And I use a browser extension to just launch mpv for YouTube videos - but for example Twitch streamed through streamlink+mpv doesn’t drop loot ;) So there’s obviously a need for nvdec in the browser! :)

A 980 definitely cannot decode HEVC. I know, because I had one, and I ended up buying a 960 as well so I could implement the vdpau HEVC support in ffmpeg. So, I’m pretty sure what’s happening is it’s silently falling back to software decoding and you’re not noticing that happening. If you attach an mpv debug level log file, I can take a look. I definitely want to avoid situations where someone says nvdec isn’t working but vdpau is.

Yeah - it’s just going to be hard to achieve properly.

I believe it was HEVC and I remember checking that it shouldn’t work on a 980, but there was/still is a huge difference in CPU usage between vo=vdpau hwdec=vdpau and vo=opengl/gpu hwdec=nvdec. I assumed this was due to at least something being hardware accelerated, but who knows.

But since then I moved to ForceFullCompositionPipeline=On and compositing WM (which vo=vdpau doesn’t like) and so I’m stuck with vo=gpu anyways, so it doesn’t make a difference, right?

Wrong :) Note that I’m all time vo=gpu now. Playing a h264 video with hwdec=vdpau, my GTX 980 goes to P5 perf level (35W draw according to nvidia-smi, basically 5W over P8, very nice). With hwdec=nvdec, same file, it stays at P2/58W. Difference at the wall is actually greater - 118W vs 145W for the whole system (that’s with a UPS, a few modems and a monitor or two - the computer itself is in double digits).

So over 25W difference when playing a video, and I play a lot of videos, so it’s actually economical to stick to hwdec=vdpau :)

If I get my dream 1080 Ti, I’ll check again. Hopefully it deals with power saving a bit better. But the 980 is truly nvdec unfriendly, so I’m sticking with vdpau for another few days ;)

Bad news - on a 1080 Ti it’s even worse - MUCH worse in fact!

1080p h264 video:

mpv --hwdec=vdpau → 110W at the wall (this one goes to P8, so better than a GTX 980)

mpv --hwdec=nvdec → 155W at the wall! (stays in P2 forever, card’s fans start after a while)

mpv --hwdec=off → 112W

720p h265 video:

mpv --hwdec=off → 111 W (vdpau falls back to software)

mpv --hwdec=nvdec → 155W

4K h265 video scaled down to a 1080 TV:

mpv --hwdec=off → 122W (stays in P5)

mpv --hwdec=nvdec → 156W

and somewhat interestingly:
mpv -vo sdl → 118W ([vo/sdl] Using opengl, card goes to P8 but CPU load is higher)

Of course all this is measured after 45+ seconds of each playback, because this is Nvidia, card wakes up to P2 or even P0 whenever a GL window pops up and stays like that for 40 seconds, just in case. I only reported the numbers after a minute of continuous playback.

What all this means is that VDPAU and software decoding are both super efficient, while NVDEC for some reason keeps the card in high performance/power mode, hence being super inefficient.

So Nvidia, please bring back VDPAU!

Please retest nvdec with this xorg.conf snippet:

Option          "RegistryDwords" "PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerLevel=0x3; PowerMizerDefault=0x3; PowerMizerDefaultAC=0x3"

It’s well known that NVIDIA power management under Linux utterly sucks.

Yes, I’m well aware of the problem. I just explained first 40 seconds are always spent in P0/P2, which we all know happens. The thread says 36, but now it’s no less than 40 in P0, then like 5s more to go through P2-P5-P8.

But when watching a movie, that’s not that important - power consumption will quickly average down to the numbers I report above.

So let’s stay on topic. The issue you mention is about how fast/slow recent drivers take to switch between power levels. Meanwhile I’m reporting that using NVDEC in mpv prevents the card from EVER using any form of power saving. So even a fix for the 40 second burnout won’t change the situation.

It may still be mpv’s or ffmpeg’s fault somehow - I have no knowledge of other media players/codecs with nvdec support, so can’t really test this hypothesis.

PS. I won’t switch to PowerMizerLevel 3 (which maps to P8 in my post) because I actually game on the card too :)

It would be great if NVIDIA could chime in and reveal what’s going on WRT video decoding acceleration in Linux.

Is VDPAU still alive? Is NVENC the way to go? What about VA-API support?

@AaronP

Given their silence on the matter, I doubt they want to comment on it. If they have a remedy for this I’m sure they would have already posted, and why bother posting just to confirm they’re not going to support yet another standard API (whether directly, or indirectly).

I’m rather wondering about the future of hw decoding on nvidia in general. Looks like on Windows, anything beyond h.264 doesn’t work anymore either. The additional media extensions seem to only support intel gpus.
Does nvdec aka nvcuvid even use the decoder unit or is it a generic cuda-based implementation which the initial name and the power consumption would speak for?

Last time I checked VC9 and H.265 hw decoding acceleration works just fine on Windows 10 (under MPC-HC/Firefox/Chrome) - both on NVIDIA and Intel GPUs. Don’t know about AMD hardware - no one around me owns AMD.

Everything works on windows. You can decode vp8/vp9/hevc through dxva (or whatever it’s called now) for new enough hardware from all three vendors.

nvdec is just a particular API - all three APIs (nvdec, dxva, vdpau) expose the same hardware decoders. CUDA code is only used for post processing and format conversion.

NVDEC is a way to go:

No word on VAAPI compatibility/support. I guess someone will have to write a compatibility layer between NVDEC and VAAPI, so that we’d finally have one universal video decoding acceleration API.

VDPAU is basically dead.

Unfortunately since there are no linux apps that support the NVDEC API this is just angels dancing on the head of a pin. If you want to use a linux app like vlc or mplayer with HW acceleration today then VDPAU and VA-API are the only game in town. And even they may be problematic.

Works great in mpv (I know, I wrote it). No one should be using mplayer in this day and age.