Video Transcoding using multiple GPUs (32 live streaming jobs)

Hello everyone.

I’m having problems with ffmpeg transcoding.

I have the following configuration:

  • 1 Quadro RTX4000
  • 6 Quadro M4000
  • 32 GB RAM
  • Intel i9-9900X (10 cores, 20 threads)
  • Motherboard Asus WS SAGE X299
  • 2x250GB SSD RAID0
  • I know that one Quadro M4000 is capable to transcode 8 jobs with following characteristics:

    Video Input:
    1080p, 24 fps, h264

    Vide Output
    720p, 24 fps, h264
    480p, 24 fps, h264
    360p, 24 fps, h264
    180p, 24 fps, h264

    Live streaming Input (Ethernet) → Transcoding → Live streaming output (localhost) (Multi resolution)

    These jobs are live streaming. This means that I need to keep an 1x speed.

    I’m using the following command:

    ffmpeg -stream_loop -1 -hwaccel_device 0 -hwaccel cuda -hwaccel_output_format cuda \
    -ignore_unknown -threads 1 -re -i 'http://dash.akamaized.net/dash264/TestCasesHD/2b/qualcomm/2/MultiRes.mpd' \
    -filter_complex '[0:v:3]yadif_cuda,scale_cuda=1280:720,split=4[720p][v1][v2][v3]; \
    [v1]scale_cuda=284:180[180p];[v2]scale_cuda=640:360[360p];[v3]scale_cuda=640:480[480p]' \
    -c:v h264_nvenc -map '[180p]' -b:v 256k -maxrate 256k -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 \
    -r 24 -map '0:4' -c:a copy -b:a 32k    -f mpegts 'udp://127.0.0.1:10000?pkt_size=1316' -c:v h264_nvenc \
    -map '[360p]' -b:v 1228800 -maxrate 1228800 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
    -map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:20000?pkt_size=1316' -c:v h264_nvenc \
    -map '[480p]' -b:v 2048000 -maxrate 2048000 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
    -map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:30000?pkt_size=1316' -c:v h264_nvenc \
    -map '[720p]' -b:v 3072000 -maxrate 3072000 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
    -map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:40000?pkt_size=1316'
    

    I want to transcode 56 of these jobs (7 GPUs with 8 jobs each) but today I’m able to transcode only 32 jobs (4 GPUs with 8 jobs each). If I launch another job, the speed starts to going down.

    With 32 jobs, CPU load average is too high but CPU consumption is 20% . RAM bandwidth is loaded at 4% of its
    capacity. I’m not writing to SSD.

    I have run several analysis using VTune and results say that I have problems with Frontend and Backend but I’m not sure how to interpret those results. I think that the nature of the jobs (live streaming) produce cache misses and branches misprediction resulting in stalls. The VTune results also says that CPI is to high (>2.5) and instruction retiring is approximate 15% of clock ticks.

    This is an image from VTune:
    https://drive.google.com/file/d/1naqotnGl1pr8osi9p5mLcSH5dGoVF6ZS/view?usp=sharing

    Somebody has a similar configuration? What do you recommend to improve the performance? Do you think a server with two Sockets could improve the performance?

    This is the output of nvidia-smi topo --matrix

    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    CPU Affinity
    GPU0     X      PIX     PIX     PIX     SYS     SYS     SYS     0-19
    GPU1    PIX      X      PIX     PIX     SYS     SYS     SYS     0-19
    GPU2    PIX     PIX      X      PIX     SYS     SYS     SYS     0-19
    GPU3    PIX     PIX     PIX      X      SYS     SYS     SYS     0-19
    GPU4    SYS     SYS     SYS     SYS      X      PIX     PIX     0-19
    GPU5    SYS     SYS     SYS     SYS     PIX      X      PIX     0-19
    GPU6    SYS     SYS     SYS     SYS     PIX     PIX      X      0-19
    
    Legend:
    
      X    = Self
      SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
      NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
      PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
      PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
      PIX  = Connection traversing a single PCIe switch
      NV#  = Connection traversing a bonded set of # NVLinks
    

    Probably your issue is related to my findings here:
    https://devtalk.nvidia.com/default/topic/1049717/video-codec-and-optical-flow-sdk/performance-limit-at-around-2500-fps-/

    You will need to patch and recompile ffmpeg with the patch I sent and test if it also fixes your issue.

    Hello malakudi,

    Thanks for your comment. I’m going to patch ffmpeg and comment the results

    Hello malakudi,

    I have applied your patch and it is working.
    I’m transcoding 64 jobs using seven GPUs (RTX 4000: 16 jobs, M4000: 8 jobs each). Have you found an explanation for this patch?

    If you want, follow up the open ticket at #7674 (ffmpeg with cuvid transcoding after version 3.4.1 work unstable on heavy load CUDA card) – FFmpeg with your user case and confirm the fix I posted works for your case too.

    My understanding of the code is not enough to explain why this code affects performance and commenting it out brings performance back. I found it by cherry picking commits.