Hello everyone.
I’m having problems with ffmpeg transcoding.
I have the following configuration:
I know that one Quadro M4000 is capable to transcode 8 jobs with following characteristics:
Video Input:
1080p, 24 fps, h264
Vide Output
720p, 24 fps, h264
480p, 24 fps, h264
360p, 24 fps, h264
180p, 24 fps, h264
Live streaming Input (Ethernet) → Transcoding → Live streaming output (localhost) (Multi resolution)
These jobs are live streaming. This means that I need to keep an 1x speed.
I’m using the following command:
ffmpeg -stream_loop -1 -hwaccel_device 0 -hwaccel cuda -hwaccel_output_format cuda \
-ignore_unknown -threads 1 -re -i 'http://dash.akamaized.net/dash264/TestCasesHD/2b/qualcomm/2/MultiRes.mpd' \
-filter_complex '[0:v:3]yadif_cuda,scale_cuda=1280:720,split=4[720p][v1][v2][v3]; \
[v1]scale_cuda=284:180[180p];[v2]scale_cuda=640:360[360p];[v3]scale_cuda=640:480[480p]' \
-c:v h264_nvenc -map '[180p]' -b:v 256k -maxrate 256k -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 \
-r 24 -map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:10000?pkt_size=1316' -c:v h264_nvenc \
-map '[360p]' -b:v 1228800 -maxrate 1228800 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
-map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:20000?pkt_size=1316' -c:v h264_nvenc \
-map '[480p]' -b:v 2048000 -maxrate 2048000 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
-map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:30000?pkt_size=1316' -c:v h264_nvenc \
-map '[720p]' -b:v 3072000 -maxrate 3072000 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
-map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:40000?pkt_size=1316'
I want to transcode 56 of these jobs (7 GPUs with 8 jobs each) but today I’m able to transcode only 32 jobs (4 GPUs with 8 jobs each). If I launch another job, the speed starts to going down.
With 32 jobs, CPU load average is too high but CPU consumption is 20% . RAM bandwidth is loaded at 4% of its
capacity. I’m not writing to SSD.
I have run several analysis using VTune and results say that I have problems with Frontend and Backend but I’m not sure how to interpret those results. I think that the nature of the jobs (live streaming) produce cache misses and branches misprediction resulting in stalls. The VTune results also says that CPI is to high (>2.5) and instruction retiring is approximate 15% of clock ticks.
This is an image from VTune:
https://drive.google.com/file/d/1naqotnGl1pr8osi9p5mLcSH5dGoVF6ZS/view?usp=sharing
Somebody has a similar configuration? What do you recommend to improve the performance? Do you think a server with two Sockets could improve the performance?
This is the output of nvidia-smi topo --matrix
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 CPU Affinity
GPU0 X PIX PIX PIX SYS SYS SYS 0-19
GPU1 PIX X PIX PIX SYS SYS SYS 0-19
GPU2 PIX PIX X PIX SYS SYS SYS 0-19
GPU3 PIX PIX PIX X SYS SYS SYS 0-19
GPU4 SYS SYS SYS SYS X PIX PIX 0-19
GPU5 SYS SYS SYS SYS PIX X PIX 0-19
GPU6 SYS SYS SYS SYS PIX PIX X 0-19
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks