Multiple cameras cause running slow of API
We built a object detection system of 4 cameras with TX1. If we run application with only 1 camera active and detection. the whole loop of detection can be run 21 times in 1 second. When four cameras are streaming and only one of them are used for detection algorithm, the detection can only be done 11 times in 1 second. I tried nvprof. Below is result of 1 camera streaming: [code] ==7317== Profiling application: ./detection -c1 ==7317== Profiling result: Time(%) Time Calls Avg Min Max Name 21.90% 33.1770s 3747 8.8543ms 7.4923ms 14.441ms void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int) 15.23% 23.0770s 26229 879.83us 70.940us 2.1807ms maxwell_sgemm_128x64_raggedMn_nn 12.04% 18.2380s 33723 540.82us 59.430us 3.5146ms im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*) 7.26% 10.9959s 93676 117.38us 1.8750us 1.9100ms fill_kernel(int, float, float*, int) 7.07% 10.7105s 29976 357.30us 28.282us 3.2339ms normalize_kernel(int, float*, float*, float*, int, int, int) 6.70% 10.1537s 71193 142.62us 1.6660us 2.0321ms activate_array_kernel(float*, int, ACTIVATION) 6.08% 9.21664s 33723 273.30us 5.0520us 2.0859ms add_bias_kernel(float*, float*, int, int, int) 5.77% 8.74433s 3747 2.3337ms 2.0663ms 3.9664ms void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *, int, float*, int, int, int, float const *, float const *, float, float, int) 5.39% 8.16327s 33723 242.07us 2.5000us 2.1818ms copy_kernel(int, float*, int, int, float*, int, int) 5.36% 8.11565s 29976 270.74us 28.908us 2.4257ms scale_bias_kernel(float*, float*, int, int) 4.91% 7.43280s 22482 330.61us 57.034us 1.7648ms forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*) 1.25% 1.89259s 3747 505.09us 353.56us 1.0201ms convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int) 1.04% 1.57767s 11451 137.78us 207ns 39.722ms [CUDA memcpy HtoH] 0.02% 29.780ms 3747 7.9470us 7.2930us 12.344us softmax_kernel(float*, int, int, int, int, int, int, float, float*) 0.00% 1.1460us 1 1.1460us 1.1460us 1.1460us [CUDA memcpy HtoD] ==7317== API calls: Time(%) Time Calls Avg Min Max Name 77.96% 124.254s 11452 10.850ms 37.033us 241.49ms cudaMemcpy 15.53% 24.7526s 389689 63.518us 29.950us 14.282ms cudaLaunch 1.66% 2.64968s 3747 707.15us 47.606us 7.9432ms cudaStreamSynchronize 1.15% 1.83160s 2577940 710ns 416ns 2.7011ms cudaSetupArgument 1.00% 1.59649s 3748 425.96us 12.032us 1.46191s cudaFree 0.87% 1.38763s 3747 370.33us 156.93us 5.0750ms cuGraphicsEGLRegisterImage 0.30% 482.49ms 3747 128.77us 56.618us 2.9200ms cuGraphicsUnregisterResource 0.27% 423.85ms 389689 1.0870us 469ns 3.0148ms cudaConfigureCall 0.24% 374.82ms 435021 861ns 468ns 2.6683ms cudaGetLastError 0.22% 357.68ms 2 178.84ms 160.33ms 197.35ms cuCtxCreate 0.21% 341.01ms 318496 1.0700us 572ns 2.1004ms cudaPeekAtLastError 0.13% 211.42ms 7494 28.211us 7.7090us 2.1873ms cudaBindTexture 0.12% 188.53ms 157 1.2008ms 54.743us 78.482ms cudaMallocManaged 0.09% 144.28ms 33724 4.2780us 1.6150us 2.7024ms cudaGetDevice 0.07% 111.71ms 1 111.71ms 111.71ms 111.71ms cuCtxDestroy 0.05% 82.356ms 3747 21.979us 11.511us 1.5091ms cudaStreamCreate 0.04% 67.692ms 3747 18.065us 8.3340us 2.3707ms cudaStreamDestroy 0.03% 45.516ms 7494 6.0730us 1.7190us 1.7276ms cudaUnbindTexture 0.02% 31.411ms 3749 8.3780us 4.2710us 1.0104ms cudaSetDevice 0.01% 23.511ms 2 11.755ms 562.53us 22.948ms cudaMallocHost 0.01% 12.465ms 3747 3.3260us 1.6660us 560.76us cuEGLStreamProducerPresentDevicePtr 0.00% 7.5535ms 1 7.5535ms 7.5535ms 7.5535ms cudaDeviceSynchronize 0.00% 1.5153ms 2 757.64us 595.29us 919.99us cudaFreeHost 0.00% 1.1524ms 3 384.14us 41.565us 681.65us cudaMalloc 0.00% 266.52us 261 1.0210us 364ns 53.649us cuDeviceGetAttribute 0.00% 50.316us 16 3.1440us 1.8230us 14.272us cudaEventCreateWithFlags 0.00% 45.993us 3 15.331us 8.4900us 27.137us cuDeviceTotalMem 0.00% 42.086us 1 42.086us 42.086us 42.086us cudaGetDeviceProperties 0.00% 16.044us 11 1.4580us 938ns 5.2610us cudaDeviceGetAttribute 0.00% 14.064us 4 3.5160us 1.7190us 6.0940us cuInit 0.00% 13.751us 7 1.9640us 677ns 6.3030us cuDeviceGetCount 0.00% 10.417us 1 10.417us 10.417us 10.417us cudaSetDeviceFlags 0.00% 7.3440us 4 1.8360us 1.0420us 2.6040us cuDeviceGetName 0.00% 6.9280us 7 989ns 573ns 1.8230us cuDeviceGet 0.00% 6.3550us 1 6.3550us 6.3550us 6.3550us cudaGetDeviceCount 0.00% 5.5210us 3 1.8400us 885ns 3.3860us cuDriverGetVersion 0.00% 3.2810us 2 1.6400us 1.3020us 1.9790us cuCtxSetCurrent [/code] Here is 4 cameras streaming, only 1 detecting: [code] ==7615== Profiling application: ./detection -c0 -c1 -c2 -c3 ==7615== Profiling result: Time(%) Time Calls Avg Min Max Name 22.71% 42.5082s 2695 15.773ms 7.6831ms 18.547ms void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int) 13.19% 24.6859s 18865 1.3086ms 73.074us 5.0093ms maxwell_sgemm_128x64_raggedMn_nn 12.61% 23.6057s 24255 973.23us 59.064us 6.5400ms im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*) 7.13% 13.3473s 67376 198.10us 1.9260us 4.4152ms fill_kernel(int, float, float*, int) 6.50% 12.1739s 24255 501.91us 2.3440us 5.0512ms copy_kernel(int, float*, int, int, float*, int, int) 6.44% 12.0468s 2695 4.4701ms 2.0662ms 7.6630ms void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *, int, float*, int, int, int, float const *, float const *, float, float, int) 6.25% 11.7044s 21560 542.88us 28.593us 5.5883ms normalize_kernel(int, float*, float*, float*, int, int, int) 6.09% 11.3933s 21560 528.45us 29.062us 4.7549ms scale_bias_kernel(float*, float*, int, int) 6.03% 11.2827s 51205 220.34us 1.6660us 4.8346ms activate_array_kernel(float*, int, ACTIVATION) 5.95% 11.1430s 24255 459.41us 5.4690us 4.8695ms add_bias_kernel(float*, float*, int, int, int) 4.89% 9.16334s 16170 566.69us 58.492us 3.7779ms forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*) 1.30% 2.43517s 2695 903.59us 353.23us 3.2210ms convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int) 0.90% 1.67766s 8295 202.25us 208ns 39.597ms [CUDA memcpy HtoH] 0.02% 30.554ms 2695 11.337us 7.6560us 17.345us softmax_kernel(float*, int, int, int, int, int, int, float, float*) 0.00% 1.3550us 1 1.3550us 1.3550us 1.3550us [CUDA memcpy HtoD] ==7615== API calls: Time(%) Time Calls Avg Min Max Name 74.58% 149.874s 8296 18.066ms 37.501us 240.87ms cudaMemcpy 18.24% 36.6611s 280281 130.80us 30.209us 33.917ms cudaLaunch 2.29% 4.59812s 2695 1.7062ms 58.699us 15.252ms cudaStreamSynchronize 1.14% 2.29234s 2695 850.59us 155.37us 24.756ms cuGraphicsEGLRegisterImage 0.94% 1.88761s 1854164 1.0180us 417ns 9.6395ms cudaSetupArgument 0.85% 1.71233s 2696 635.14us 12.188us 1.54340s cudaFree 0.39% 776.94ms 2695 288.29us 58.439us 13.236ms cuGraphicsUnregisterResource 0.28% 555.74ms 280281 1.9820us 469ns 9.6958ms cudaConfigureCall 0.22% 444.38ms 312989 1.4190us 468ns 5.6845ms cudaGetLastError 0.21% 421.82ms 229076 1.8410us 572ns 5.0321ms cudaPeekAtLastError 0.21% 417.74ms 2 208.87ms 157.93ms 259.80ms cuCtxCreate 0.14% 271.40ms 5390 50.352us 8.6460us 9.5489ms cudaBindTexture 0.11% 217.05ms 24256 8.9480us 1.7180us 4.3617ms cudaGetDevice 0.09% 173.35ms 157 1.1042ms 53.074us 64.627ms cudaMallocManaged 0.08% 152.08ms 8 19.010ms 70.574us 73.692ms cudaMallocHost 0.07% 131.61ms 2695 48.833us 11.979us 8.6015ms cudaStreamCreate 0.06% 119.68ms 2695 44.407us 8.4890us 6.9904ms cudaStreamDestroy 0.06% 115.08ms 1 115.08ms 115.08ms 115.08ms cuCtxDestroy 0.03% 58.050ms 5390 10.770us 1.7180us 4.1501ms cudaUnbindTexture 0.03% 52.572ms 2697 19.492us 4.1150us 5.7802ms cudaSetDevice 0.01% 15.497ms 2695 5.7500us 1.6660us 2.8814ms cuEGLStreamProducerPresentDevicePtr 0.00% 7.6157ms 8 951.96us 317.24us 1.7291ms cudaFreeHost 0.00% 7.4092ms 1 7.4092ms 7.4092ms 7.4092ms cudaDeviceSynchronize 0.00% 918.92us 3 306.31us 42.293us 481.52us cudaMalloc 0.00% 297.92us 261 1.1410us 364ns 50.886us cuDeviceGetAttribute 0.00% 49.060us 16 3.0660us 1.8750us 14.791us cudaEventCreateWithFlags 0.00% 43.021us 3 14.340us 8.9580us 24.323us cuDeviceTotalMem 0.00% 40.887us 1 40.887us 40.887us 40.887us cudaGetDeviceProperties 0.00% 16.249us 11 1.4770us 937ns 5.4160us cudaDeviceGetAttribute 0.00% 14.323us 4 3.5800us 2.5520us 6.1980us cuInit 0.00% 12.448us 7 1.7780us 677ns 5.4690us cuDeviceGetCount 0.00% 8.0720us 4 2.0180us 1.4580us 2.3440us cuDeviceGetName 0.00% 7.4480us 1 7.4480us 7.4480us 7.4480us cudaSetDeviceFlags 0.00% 7.1360us 1 7.1360us 7.1360us 7.1360us cudaGetDeviceCount 0.00% 6.3030us 7 900ns 573ns 1.3030us cuDeviceGet 0.00% 5.8850us 3 1.9610us 937ns 3.3850us cuDriverGetVersion 0.00% 2.7610us 2 1.3800us 1.3550us 1.4060us cuCtxSetCurrent [/code] We can see the Avg cudaMemcpy with 1 cam activate is 10ms and 4cam activate is 18ms. I thought maybe it cause by initial phase, so I use visual profile to check cost of each loop. 1 cam streaming and detecting: https://pasteboard.co/GZP5oXG.png 4 cam streaming and only 1 detecting: https://pasteboard.co/GZOS2au.png From the result we can see the cost of cudaMemcpy in each loop increase from 33ms to 53ms. My question is Why multi streaming cameras cause API slow?
We built a object detection system of 4 cameras with TX1.

If we run application with only 1 camera active and detection. the whole loop of detection can be run 21 times in 1 second.

When four cameras are streaming and only one of them are used for detection algorithm, the detection can only be done 11 times in 1 second.

I tried nvprof. Below is result of 1 camera streaming:

==7317== Profiling application: ./detection -c1
==7317== Profiling result:
Time(%) Time Calls Avg Min Max Name
21.90% 33.1770s 3747 8.8543ms 7.4923ms 14.441ms void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>,
float const *, float const *, float, float, int)
15.23% 23.0770s 26229 879.83us 70.940us 2.1807ms maxwell_sgemm_128x64_raggedMn_nn
12.04% 18.2380s 33723 540.82us 59.430us 3.5146ms im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*)
7.26% 10.9959s 93676 117.38us 1.8750us 1.9100ms fill_kernel(int, float, float*, int)
7.07% 10.7105s 29976 357.30us 28.282us 3.2339ms normalize_kernel(int, float*, float*, float*, int, int, int)
6.70% 10.1537s 71193 142.62us 1.6660us 2.0321ms activate_array_kernel(float*, int, ACTIVATION)
6.08% 9.21664s 33723 273.30us 5.0520us 2.0859ms add_bias_kernel(float*, float*, int, int, int)
5.77% 8.74433s 3747 2.3337ms 2.0663ms 3.9664ms void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *,
int, float*, int, int, int, float const *, float const *, float, float, int)
5.39% 8.16327s 33723 242.07us 2.5000us 2.1818ms copy_kernel(int, float*, int, int, float*, int, int)
5.36% 8.11565s 29976 270.74us 28.908us 2.4257ms scale_bias_kernel(float*, float*, int, int)
4.91% 7.43280s 22482 330.61us 57.034us 1.7648ms forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*)
1.25% 1.89259s 3747 505.09us 353.56us 1.0201ms convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int)
1.04% 1.57767s 11451 137.78us 207ns 39.722ms [CUDA memcpy HtoH]
0.02% 29.780ms 3747 7.9470us 7.2930us 12.344us softmax_kernel(float*, int, int, int, int, int, int, float, float*)
0.00% 1.1460us 1 1.1460us 1.1460us 1.1460us [CUDA memcpy HtoD]

==7317== API calls:
Time(%) Time Calls Avg Min Max Name
77.96% 124.254s 11452 10.850ms 37.033us 241.49ms cudaMemcpy
15.53% 24.7526s 389689 63.518us 29.950us 14.282ms cudaLaunch
1.66% 2.64968s 3747 707.15us 47.606us 7.9432ms cudaStreamSynchronize
1.15% 1.83160s 2577940 710ns 416ns 2.7011ms cudaSetupArgument
1.00% 1.59649s 3748 425.96us 12.032us 1.46191s cudaFree
0.87% 1.38763s 3747 370.33us 156.93us 5.0750ms cuGraphicsEGLRegisterImage
0.30% 482.49ms 3747 128.77us 56.618us 2.9200ms cuGraphicsUnregisterResource
0.27% 423.85ms 389689 1.0870us 469ns 3.0148ms cudaConfigureCall
0.24% 374.82ms 435021 861ns 468ns 2.6683ms cudaGetLastError
0.22% 357.68ms 2 178.84ms 160.33ms 197.35ms cuCtxCreate
0.21% 341.01ms 318496 1.0700us 572ns 2.1004ms cudaPeekAtLastError
0.13% 211.42ms 7494 28.211us 7.7090us 2.1873ms cudaBindTexture
0.12% 188.53ms 157 1.2008ms 54.743us 78.482ms cudaMallocManaged
0.09% 144.28ms 33724 4.2780us 1.6150us 2.7024ms cudaGetDevice
0.07% 111.71ms 1 111.71ms 111.71ms 111.71ms cuCtxDestroy
0.05% 82.356ms 3747 21.979us 11.511us 1.5091ms cudaStreamCreate
0.04% 67.692ms 3747 18.065us 8.3340us 2.3707ms cudaStreamDestroy
0.03% 45.516ms 7494 6.0730us 1.7190us 1.7276ms cudaUnbindTexture
0.02% 31.411ms 3749 8.3780us 4.2710us 1.0104ms cudaSetDevice
0.01% 23.511ms 2 11.755ms 562.53us 22.948ms cudaMallocHost
0.01% 12.465ms 3747 3.3260us 1.6660us 560.76us cuEGLStreamProducerPresentDevicePtr
0.00% 7.5535ms 1 7.5535ms 7.5535ms 7.5535ms cudaDeviceSynchronize
0.00% 1.5153ms 2 757.64us 595.29us 919.99us cudaFreeHost
0.00% 1.1524ms 3 384.14us 41.565us 681.65us cudaMalloc
0.00% 266.52us 261 1.0210us 364ns 53.649us cuDeviceGetAttribute
0.00% 50.316us 16 3.1440us 1.8230us 14.272us cudaEventCreateWithFlags
0.00% 45.993us 3 15.331us 8.4900us 27.137us cuDeviceTotalMem
0.00% 42.086us 1 42.086us 42.086us 42.086us cudaGetDeviceProperties
0.00% 16.044us 11 1.4580us 938ns 5.2610us cudaDeviceGetAttribute
0.00% 14.064us 4 3.5160us 1.7190us 6.0940us cuInit
0.00% 13.751us 7 1.9640us 677ns 6.3030us cuDeviceGetCount
0.00% 10.417us 1 10.417us 10.417us 10.417us cudaSetDeviceFlags
0.00% 7.3440us 4 1.8360us 1.0420us 2.6040us cuDeviceGetName
0.00% 6.9280us 7 989ns 573ns 1.8230us cuDeviceGet
0.00% 6.3550us 1 6.3550us 6.3550us 6.3550us cudaGetDeviceCount
0.00% 5.5210us 3 1.8400us 885ns 3.3860us cuDriverGetVersion
0.00% 3.2810us 2 1.6400us 1.3020us 1.9790us cuCtxSetCurrent


Here is 4 cameras streaming, only 1 detecting:

==7615== Profiling application: ./detection -c0 -c1 -c2 -c3
==7615== Profiling result:
Time(%) Time Calls Avg Min Max Name
22.71% 42.5082s 2695 15.773ms 7.6831ms 18.547ms void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>,
float const *, float const *, float, float, int)
13.19% 24.6859s 18865 1.3086ms 73.074us 5.0093ms maxwell_sgemm_128x64_raggedMn_nn
12.61% 23.6057s 24255 973.23us 59.064us 6.5400ms im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*)
7.13% 13.3473s 67376 198.10us 1.9260us 4.4152ms fill_kernel(int, float, float*, int)
6.50% 12.1739s 24255 501.91us 2.3440us 5.0512ms copy_kernel(int, float*, int, int, float*, int, int)
6.44% 12.0468s 2695 4.4701ms 2.0662ms 7.6630ms void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *,
int, float*, int, int, int, float const *, float const *, float, float, int)
6.25% 11.7044s 21560 542.88us 28.593us 5.5883ms normalize_kernel(int, float*, float*, float*, int, int, int)
6.09% 11.3933s 21560 528.45us 29.062us 4.7549ms scale_bias_kernel(float*, float*, int, int)
6.03% 11.2827s 51205 220.34us 1.6660us 4.8346ms activate_array_kernel(float*, int, ACTIVATION)
5.95% 11.1430s 24255 459.41us 5.4690us 4.8695ms add_bias_kernel(float*, float*, int, int, int)
4.89% 9.16334s 16170 566.69us 58.492us 3.7779ms forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*)
1.30% 2.43517s 2695 903.59us 353.23us 3.2210ms convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int)
0.90% 1.67766s 8295 202.25us 208ns 39.597ms [CUDA memcpy HtoH]
0.02% 30.554ms 2695 11.337us 7.6560us 17.345us softmax_kernel(float*, int, int, int, int, int, int, float, float*)
0.00% 1.3550us 1 1.3550us 1.3550us 1.3550us [CUDA memcpy HtoD]

==7615== API calls:
Time(%) Time Calls Avg Min Max Name
74.58% 149.874s 8296 18.066ms 37.501us 240.87ms cudaMemcpy
18.24% 36.6611s 280281 130.80us 30.209us 33.917ms cudaLaunch
2.29% 4.59812s 2695 1.7062ms 58.699us 15.252ms cudaStreamSynchronize
1.14% 2.29234s 2695 850.59us 155.37us 24.756ms cuGraphicsEGLRegisterImage
0.94% 1.88761s 1854164 1.0180us 417ns 9.6395ms cudaSetupArgument
0.85% 1.71233s 2696 635.14us 12.188us 1.54340s cudaFree
0.39% 776.94ms 2695 288.29us 58.439us 13.236ms cuGraphicsUnregisterResource
0.28% 555.74ms 280281 1.9820us 469ns 9.6958ms cudaConfigureCall
0.22% 444.38ms 312989 1.4190us 468ns 5.6845ms cudaGetLastError
0.21% 421.82ms 229076 1.8410us 572ns 5.0321ms cudaPeekAtLastError
0.21% 417.74ms 2 208.87ms 157.93ms 259.80ms cuCtxCreate
0.14% 271.40ms 5390 50.352us 8.6460us 9.5489ms cudaBindTexture
0.11% 217.05ms 24256 8.9480us 1.7180us 4.3617ms cudaGetDevice
0.09% 173.35ms 157 1.1042ms 53.074us 64.627ms cudaMallocManaged
0.08% 152.08ms 8 19.010ms 70.574us 73.692ms cudaMallocHost
0.07% 131.61ms 2695 48.833us 11.979us 8.6015ms cudaStreamCreate
0.06% 119.68ms 2695 44.407us 8.4890us 6.9904ms cudaStreamDestroy
0.06% 115.08ms 1 115.08ms 115.08ms 115.08ms cuCtxDestroy
0.03% 58.050ms 5390 10.770us 1.7180us 4.1501ms cudaUnbindTexture
0.03% 52.572ms 2697 19.492us 4.1150us 5.7802ms cudaSetDevice
0.01% 15.497ms 2695 5.7500us 1.6660us 2.8814ms cuEGLStreamProducerPresentDevicePtr
0.00% 7.6157ms 8 951.96us 317.24us 1.7291ms cudaFreeHost
0.00% 7.4092ms 1 7.4092ms 7.4092ms 7.4092ms cudaDeviceSynchronize
0.00% 918.92us 3 306.31us 42.293us 481.52us cudaMalloc
0.00% 297.92us 261 1.1410us 364ns 50.886us cuDeviceGetAttribute
0.00% 49.060us 16 3.0660us 1.8750us 14.791us cudaEventCreateWithFlags
0.00% 43.021us 3 14.340us 8.9580us 24.323us cuDeviceTotalMem
0.00% 40.887us 1 40.887us 40.887us 40.887us cudaGetDeviceProperties
0.00% 16.249us 11 1.4770us 937ns 5.4160us cudaDeviceGetAttribute
0.00% 14.323us 4 3.5800us 2.5520us 6.1980us cuInit
0.00% 12.448us 7 1.7780us 677ns 5.4690us cuDeviceGetCount
0.00% 8.0720us 4 2.0180us 1.4580us 2.3440us cuDeviceGetName
0.00% 7.4480us 1 7.4480us 7.4480us 7.4480us cudaSetDeviceFlags
0.00% 7.1360us 1 7.1360us 7.1360us 7.1360us cudaGetDeviceCount
0.00% 6.3030us 7 900ns 573ns 1.3030us cuDeviceGet
0.00% 5.8850us 3 1.9610us 937ns 3.3850us cuDriverGetVersion
0.00% 2.7610us 2 1.3800us 1.3550us 1.4060us cuCtxSetCurrent


We can see the Avg cudaMemcpy with 1 cam activate is 10ms and 4cam activate is 18ms.

I thought maybe it cause by initial phase, so I use visual profile to check cost of each loop.

1 cam streaming and detecting:


https://pasteboard.co/GZP5oXG.png


4 cam streaming and only 1 detecting:


https://pasteboard.co/GZOS2au.png


From the result we can see the cost of cudaMemcpy in each loop increase from 33ms to 53ms.

My question is Why multi streaming cameras cause API slow?

#1
Posted 12/25/2017 09:47 AM   
Hi xxxss, Have you tried to run in max performance? By running 'sudo ./jetson_clocks.sh', the CPU/GPU/EMC frequency keeps at max clocks. Should help to get better performance. Also could you share the full pipeline of your case? You will get YUV420 from Argus. Do you convert it to RGBA via NvVideoConverter or CUDA? For the other three cameras not running object detection, cudaMemcpy is performed? Or you just get frames and do nothing?
Hi xxxss,
Have you tried to run in max performance? By running 'sudo ./jetson_clocks.sh', the CPU/GPU/EMC frequency keeps at max clocks. Should help to get better performance.

Also could you share the full pipeline of your case? You will get YUV420 from Argus. Do you convert it to RGBA via NvVideoConverter or CUDA? For the other three cameras not running object detection, cudaMemcpy is performed? Or you just get frames and do nothing?

#2
Posted 12/26/2017 01:23 AM   
Yes, I have tried jetson_clocks.sh. The result does not change much. The cost difference between 1 cam and 4 cam remain still. For the other three cameras not running object detection, cudaMemcpy is not performed. [code] bool ConsumerThread::converterCapturePlaneDqCallback( struct v4l2_buffer *v4l2_buf, NvBuffer * buffer, NvBuffer * shared_buffer, void *arg) { ConsumerThread *thiz = (ConsumerThread*)arg; if (!v4l2_buf) { REPORT_ERROR("Failed to dequeue buffer from conv capture plane"); thiz->abort(); return false; } if (v4l2_buf->m.planes[0].bytesused == 0) { return false; } pthread_mutex_lock(&ctx.net_Lock); thiz->egl_image = NvEGLImageFromFd(egl_display, buffer->planes[0].fd); if (thiz->egl_image == NULL) printf("egl_image NULL %s%d\n", __func__, __LINE__); mapEGLImage2Float(&thiz->egl_image, p_ctx->net_w, p_ctx->net_h, (char *)thiz->fbuf, thiz->showImg->imageData); NvDestroyEGLImage(egl_display, thiz->egl_image); object_detect_run(mdata, thiz->fbuf); pthread_mutex_unlock(&ctx.net_Lock); e = thiz->m_ImageConverter->capture_plane.qBuffer(*v4l2_buf, NULL); if (e < 0) ORIGINATE_ERROR("qBuffer failed"); return true; } bool ConsumerThread::converterOutputPlaneDqCallback( struct v4l2_buffer *v4l2_buf, NvBuffer * buffer, NvBuffer * shared_buffer, void *arg) { ConsumerThread *thiz = (ConsumerThread*)arg; if (!v4l2_buf) { REPORT_ERROR("Failed to dequeue buffer from conv capture plane"); thiz->abort(); return false; } if (v4l2_buf->m.planes[0].bytesused == 0) { return false; } NvBufferDestroy(shared_buffer->planes[0].fd); //CONSUMER_PRINT("releaseFd %d (%d frames)\n", shared_buffer->planes[0].fd, thiz->m_numPendingFrames); pthread_mutex_lock(&thiz->m_queueLock); thiz->m_numPendingFrames--; thiz->m_ConvOutputPlaneBufQueue->push(buffer); pthread_cond_broadcast(&thiz->m_queueCond); pthread_mutex_unlock(&thiz->m_queueLock); return true; } bool ConsumerThread::createImageConverter() { int ret = 0; char cname[10]; sprintf(cname, "conv%d", cam_idx); // YUV420 --> RGB32 converter m_ImageConverter = NvVideoConverter::createVideoConverter(cname); if (!m_ImageConverter) ORIGINATE_ERROR("Could not create m_ImageConverteroder"); if (DO_STAT) m_ImageConverter->enableProfiling(); m_ImageConverter->capture_plane. setDQThreadCallback(converterCapturePlaneDqCallback); m_ImageConverter->output_plane. setDQThreadCallback(converterOutputPlaneDqCallback); ret = m_ImageConverter->setOutputPlaneFormat(V4L2_PIX_FMT_YUV420M, m_pContext->width, m_pContext->height, V4L2_NV_BUFFER_LAYOUT_BLOCKLINEAR); if (ret < 0) ORIGINATE_ERROR("Could not set output plane format"); ret = m_ImageConverter->setCapturePlaneFormat(V4L2_PIX_FMT_ABGR32, m_pContext->net_w, m_pContext->net_h, V4L2_NV_BUFFER_LAYOUT_PITCH); if (ret < 0) ORIGINATE_ERROR("Could not set capture plane format"); ret = m_ImageConverter->setCropRect(962, 159, 726, 726); if (ret < 0) ORIGINATE_ERROR("Could not set crop rect"); // Query, Export and Map the output plane buffers so that we can read // raw data into the buffers ret = m_ImageConverter->output_plane.setupPlane(V4L2_MEMORY_DMABUF, conv_buf_num, false, false); if (ret < 0) ORIGINATE_ERROR("Could not setup output plane"); // Query, Export and Map the output plane buffers so that we can write // m_ImageConverteroded data from the buffers ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, true, false); if (ret < 0) ORIGINATE_ERROR("Could not setup capture plane"); // Add all empty conv output plane buffers to m_ConvOutputPlaneBufQueue for (uint32_t i = 0; i < m_ImageConverter->output_plane.getNumBuffers(); i++) { m_ConvOutputPlaneBufQueue->push( m_ImageConverter->output_plane.getNthBuffer(i)); } // conv output plane STREAMON ret = m_ImageConverter->output_plane.setStreamStatus(true); if (ret < 0) ORIGINATE_ERROR("fail to set conv output stream on"); // conv capture plane STREAMON ret = m_ImageConverter->capture_plane.setStreamStatus(true); if (ret < 0) ORIGINATE_ERROR("fail to set conv capture stream on"); // Start threads to dequeue buffers on conv capture plane, // conv output plane and capture plane m_ImageConverter->capture_plane.startDQThread(this); m_ImageConverter->output_plane.startDQThread(this); // Enqueue all empty conv capture plane buffers for (uint32_t i = 0; i < m_ImageConverter->capture_plane.getNumBuffers(); i++) { struct v4l2_buffer v4l2_buf; struct v4l2_plane planes[MAX_PLANES]; memset(&v4l2_buf, 0, sizeof(v4l2_buf)); memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane)); v4l2_buf.index = i; v4l2_buf.m.planes = planes; ret = m_ImageConverter->capture_plane.qBuffer(v4l2_buf, NULL); if (ret < 0) { abort(); ORIGINATE_ERROR("Error queueing buffer at conv capture plane"); } printf(" i: %d\n", i); } printf("create vidoe converter return true\n"); return true; } /******************************************************************************* * Argus Producer thread: * Opens the Argus camera driver, creates an OutputStream to output to a * FrameConsumer, then performs repeating capture requests for CAPTURE_TIME * seconds before closing the producer and Argus driver. ******************************************************************************/ static void *capture_thread_3(void *arg) { - UniqueObj<CaptureSession> captureSession( iCameraProvider->createCaptureSession(cameraDevices[cam_idx])); ICaptureSession *iCaptureSession = interface_cast<ICaptureSession>(captureSession); if (!iCaptureSession) ORIGINATE_ERROR("Failed to get ICaptureSession interface"); printf("%s, %s, %d\n", __FILE__, __func__, __LINE__); // Create the OutputStream. PRODUCER_PRINT("Creating output stream\n"); UniqueObj<OutputStreamSettings> streamSettings(iCaptureSession->createOutputStreamSettings()); IOutputStreamSettings *iStreamSettings = interface_cast<IOutputStreamSettings>(streamSettings); if (iStreamSettings) { iStreamSettings->setPixelFormat(PIXEL_FMT_YCbCr_420_888); iStreamSettings->setResolution(Size(ctx.width, ctx.height)); // iStreamSettings->setResolution(Size(p_ctx->, 1080)); #ifdef RENDER iStreamSettings->setEGLDisplay(m_renderer->getEGLDisplay()); #endif printf("%s, %s, %d\n", __FILE__, __func__, __LINE__); } else { ORIGINATE_ERROR("NULL for output stream settings!"); } UniqueObj<OutputStream> outputStream(iCaptureSession->createOutputStream(streamSettings.get())); // Launch the FrameConsumer thread to consume frames from the OutputStream. PRODUCER_PRINT("Launching consumer thread\n"); ConsumerThread frameConsumerThread(outputStream.get(), cam_idx); PROPAGATE_ERROR(frameConsumerThread.initialize()); // Wait until the consumer is connected to the stream. PROPAGATE_ERROR(frameConsumerThread.waitRunning()); // Create capture request and enable output stream. UniqueObj<Request> request(iCaptureSession->createRequest()); IRequest *iRequest = interface_cast<IRequest>(request); if (!iRequest) ORIGINATE_ERROR("Failed to create Request"); iRequest->enableOutputStream(outputStream.get()); ISourceSettings *iSourceSettings = interface_cast<ISourceSettings>(iRequest->getSourceSettings()); if (!iSourceSettings) ORIGINATE_ERROR("Failed to get ISourceSettings interface"); iSourceSettings->setFrameDurationRange(Range<uint64_t>(1e9/DEFAULT_FPS)); // Submit capture requests. PRODUCER_PRINT("Starting repeat capture requests.\n"); if (iCaptureSession->repeat(request.get()) != STATUS_OK) ORIGINATE_ERROR("Failed to start repeat capture request"); for (int i = 0; i < CAPTURE_TIME && !frameConsumerThread.isInError(); i++ ) sleep(1); // Stop the repeating request and wait for idle. iCaptureSession->stopRepeat(); iCaptureSession->waitForIdle(); // Destroy the output stream to end the consumer thread. outputStream.reset(); // Wait for the consumer thread to complete. PROPAGATE_ERROR(frameConsumerThread.shutdown()); PRODUCER_PRINT("Done -- exiting.\n"); } int main(int argc, const char *argv[]) { // Create the CameraProvider object and get the core interface. UniqueObj<CameraProvider> cameraProvider = UniqueObj<CameraProvider>(CameraProvider::create()); iCameraProvider = interface_cast<ICameraProvider>(cameraProvider); if (!iCameraProvider) ORIGINATE_ERROR("Failed to create CameraProvider"); // Get the camera devices. iCameraProvider->getCameraDevices(&cameraDevices); if (cameraDevices.size() == 0) ORIGINATE_ERROR("No cameras available"); for (int i = 0; i < 4; i++) { printf("i: %d, cam:%d\n", i, p_ctx->cam[i]); if (cameraDevices[i] && p_ctx->cam[i]) { if (i == 0) pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_0, (void *)&p_ctx->cam[i]); if (i == 1) pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_1, (void *)&p_ctx->cam[i]); if (i == 2) pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_2, (void *)&p_ctx->cam[i]); if (i == 3) pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_3, (void *)&p_ctx->cam[i]); sleep(3); } #if 0 if (!ArgusSamples::execute(p_ctx)) return EXIT_FAILURE; #endif } return ret; } [/code]
Yes, I have tried jetson_clocks.sh. The result does not change much. The cost difference between 1 cam and 4 cam remain still.

For the other three cameras not running object detection, cudaMemcpy is not performed.


bool ConsumerThread::converterCapturePlaneDqCallback(
struct v4l2_buffer *v4l2_buf,
NvBuffer * buffer,
NvBuffer * shared_buffer,
void *arg)
{
ConsumerThread *thiz = (ConsumerThread*)arg;


if (!v4l2_buf)
{
REPORT_ERROR("Failed to dequeue buffer from conv capture plane");
thiz->abort();
return false;
}

if (v4l2_buf->m.planes[0].bytesused == 0)
{
return false;
}

pthread_mutex_lock(&ctx.net_Lock);

thiz->egl_image = NvEGLImageFromFd(egl_display, buffer->planes[0].fd);
if (thiz->egl_image == NULL)
printf("egl_image NULL %s%d\n", __func__, __LINE__);
mapEGLImage2Float(&thiz->egl_image, p_ctx->net_w, p_ctx->net_h, (char *)thiz->fbuf, thiz->showImg->imageData);
NvDestroyEGLImage(egl_display, thiz->egl_image);

object_detect_run(mdata, thiz->fbuf);

pthread_mutex_unlock(&ctx.net_Lock);

e = thiz->m_ImageConverter->capture_plane.qBuffer(*v4l2_buf, NULL);
if (e < 0)
ORIGINATE_ERROR("qBuffer failed");

return true;
}

bool ConsumerThread::converterOutputPlaneDqCallback(
struct v4l2_buffer *v4l2_buf,
NvBuffer * buffer,
NvBuffer * shared_buffer,
void *arg)
{
ConsumerThread *thiz = (ConsumerThread*)arg;

if (!v4l2_buf)
{
REPORT_ERROR("Failed to dequeue buffer from conv capture plane");
thiz->abort();
return false;
}

if (v4l2_buf->m.planes[0].bytesused == 0)
{
return false;
}

NvBufferDestroy(shared_buffer->planes[0].fd);

//CONSUMER_PRINT("releaseFd %d (%d frames)\n", shared_buffer->planes[0].fd, thiz->m_numPendingFrames);
pthread_mutex_lock(&thiz->m_queueLock);
thiz->m_numPendingFrames--;
thiz->m_ConvOutputPlaneBufQueue->push(buffer);
pthread_cond_broadcast(&thiz->m_queueCond);
pthread_mutex_unlock(&thiz->m_queueLock);

return true;
}

bool ConsumerThread::createImageConverter()
{
int ret = 0;
char cname[10];

sprintf(cname, "conv%d", cam_idx);
// YUV420 --> RGB32 converter
m_ImageConverter = NvVideoConverter::createVideoConverter(cname);
if (!m_ImageConverter)
ORIGINATE_ERROR("Could not create m_ImageConverteroder");

if (DO_STAT)
m_ImageConverter->enableProfiling();


m_ImageConverter->capture_plane.
setDQThreadCallback(converterCapturePlaneDqCallback);
m_ImageConverter->output_plane.
setDQThreadCallback(converterOutputPlaneDqCallback);


ret = m_ImageConverter->setOutputPlaneFormat(V4L2_PIX_FMT_YUV420M, m_pContext->width,
m_pContext->height, V4L2_NV_BUFFER_LAYOUT_BLOCKLINEAR);
if (ret < 0)
ORIGINATE_ERROR("Could not set output plane format");

ret = m_ImageConverter->setCapturePlaneFormat(V4L2_PIX_FMT_ABGR32, m_pContext->net_w,
m_pContext->net_h, V4L2_NV_BUFFER_LAYOUT_PITCH);
if (ret < 0)
ORIGINATE_ERROR("Could not set capture plane format");

ret = m_ImageConverter->setCropRect(962, 159, 726, 726);
if (ret < 0)
ORIGINATE_ERROR("Could not set crop rect");

// Query, Export and Map the output plane buffers so that we can read
// raw data into the buffers
ret = m_ImageConverter->output_plane.setupPlane(V4L2_MEMORY_DMABUF, conv_buf_num, false, false);
if (ret < 0)
ORIGINATE_ERROR("Could not setup output plane");

// Query, Export and Map the output plane buffers so that we can write
// m_ImageConverteroded data from the buffers
ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, true, false);
if (ret < 0)
ORIGINATE_ERROR("Could not setup capture plane");

// Add all empty conv output plane buffers to m_ConvOutputPlaneBufQueue
for (uint32_t i = 0; i < m_ImageConverter->output_plane.getNumBuffers(); i++)
{
m_ConvOutputPlaneBufQueue->push(
m_ImageConverter->output_plane.getNthBuffer(i));
}

// conv output plane STREAMON
ret = m_ImageConverter->output_plane.setStreamStatus(true);
if (ret < 0)
ORIGINATE_ERROR("fail to set conv output stream on");

// conv capture plane STREAMON
ret = m_ImageConverter->capture_plane.setStreamStatus(true);
if (ret < 0)
ORIGINATE_ERROR("fail to set conv capture stream on");

// Start threads to dequeue buffers on conv capture plane,
// conv output plane and capture plane
m_ImageConverter->capture_plane.startDQThread(this);
m_ImageConverter->output_plane.startDQThread(this);

// Enqueue all empty conv capture plane buffers
for (uint32_t i = 0; i < m_ImageConverter->capture_plane.getNumBuffers(); i++)
{
struct v4l2_buffer v4l2_buf;
struct v4l2_plane planes[MAX_PLANES];

memset(&v4l2_buf, 0, sizeof(v4l2_buf));
memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane));

v4l2_buf.index = i;
v4l2_buf.m.planes = planes;

ret = m_ImageConverter->capture_plane.qBuffer(v4l2_buf, NULL);
if (ret < 0) {
abort();
ORIGINATE_ERROR("Error queueing buffer at conv capture plane");
}
printf(" i: %d\n", i);
}

printf("create vidoe converter return true\n");
return true;
}

/*******************************************************************************
* Argus Producer thread:
* Opens the Argus camera driver, creates an OutputStream to output to a
* FrameConsumer, then performs repeating capture requests for CAPTURE_TIME
* seconds before closing the producer and Argus driver.
******************************************************************************/
static void *capture_thread_3(void *arg)
{
-
UniqueObj<CaptureSession> captureSession(
iCameraProvider->createCaptureSession(cameraDevices[cam_idx]));
ICaptureSession *iCaptureSession = interface_cast<ICaptureSession>(captureSession);
if (!iCaptureSession)
ORIGINATE_ERROR("Failed to get ICaptureSession interface");

printf("%s, %s, %d\n", __FILE__, __func__, __LINE__);
// Create the OutputStream.
PRODUCER_PRINT("Creating output stream\n");
UniqueObj<OutputStreamSettings> streamSettings(iCaptureSession->createOutputStreamSettings());
IOutputStreamSettings *iStreamSettings = interface_cast<IOutputStreamSettings>(streamSettings);
if (iStreamSettings)
{
iStreamSettings->setPixelFormat(PIXEL_FMT_YCbCr_420_888);
iStreamSettings->setResolution(Size(ctx.width, ctx.height));
// iStreamSettings->setResolution(Size(p_ctx->, 1080));
#ifdef RENDER
iStreamSettings->setEGLDisplay(m_renderer->getEGLDisplay());
#endif
printf("%s, %s, %d\n", __FILE__, __func__, __LINE__);
}
else
{
ORIGINATE_ERROR("NULL for output stream settings!");
}
UniqueObj<OutputStream> outputStream(iCaptureSession->createOutputStream(streamSettings.get()));

// Launch the FrameConsumer thread to consume frames from the OutputStream.
PRODUCER_PRINT("Launching consumer thread\n");
ConsumerThread frameConsumerThread(outputStream.get(), cam_idx);
PROPAGATE_ERROR(frameConsumerThread.initialize());

// Wait until the consumer is connected to the stream.
PROPAGATE_ERROR(frameConsumerThread.waitRunning());

// Create capture request and enable output stream.
UniqueObj<Request> request(iCaptureSession->createRequest());
IRequest *iRequest = interface_cast<IRequest>(request);
if (!iRequest)
ORIGINATE_ERROR("Failed to create Request");
iRequest->enableOutputStream(outputStream.get());

ISourceSettings *iSourceSettings = interface_cast<ISourceSettings>(iRequest->getSourceSettings());
if (!iSourceSettings)
ORIGINATE_ERROR("Failed to get ISourceSettings interface");

iSourceSettings->setFrameDurationRange(Range<uint64_t>(1e9/DEFAULT_FPS));

// Submit capture requests.
PRODUCER_PRINT("Starting repeat capture requests.\n");
if (iCaptureSession->repeat(request.get()) != STATUS_OK)
ORIGINATE_ERROR("Failed to start repeat capture request");

for (int i = 0; i < CAPTURE_TIME && !frameConsumerThread.isInError(); i++ )
sleep(1);

// Stop the repeating request and wait for idle.
iCaptureSession->stopRepeat();
iCaptureSession->waitForIdle();

// Destroy the output stream to end the consumer thread.
outputStream.reset();

// Wait for the consumer thread to complete.
PROPAGATE_ERROR(frameConsumerThread.shutdown());


PRODUCER_PRINT("Done -- exiting.\n");

}


int main(int argc, const char *argv[])
{

// Create the CameraProvider object and get the core interface.
UniqueObj<CameraProvider> cameraProvider = UniqueObj<CameraProvider>(CameraProvider::create());
iCameraProvider = interface_cast<ICameraProvider>(cameraProvider);
if (!iCameraProvider)
ORIGINATE_ERROR("Failed to create CameraProvider");

// Get the camera devices.
iCameraProvider->getCameraDevices(&cameraDevices);
if (cameraDevices.size() == 0)
ORIGINATE_ERROR("No cameras available");

for (int i = 0; i < 4; i++) {
printf("i: %d, cam:%d\n", i, p_ctx->cam[i]);
if (cameraDevices[i] && p_ctx->cam[i]) {
if (i == 0)
pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_0, (void *)&p_ctx->cam[i]);

if (i == 1)
pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_1, (void *)&p_ctx->cam[i]);

if (i == 2)
pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_2, (void *)&p_ctx->cam[i]);

if (i == 3)
pthread_create(&p_ctx->camera_feed_handle[i], NULL, ArgusSamples::capture_thread_3, (void *)&p_ctx->cam[i]);
sleep(3);
}
#if 0
if (!ArgusSamples::execute(p_ctx))
return EXIT_FAILURE;
#endif
}
return ret;
}

#3
Posted 12/26/2017 06:55 AM   
Hi xxxss, Can you check tegrastats of both cases?
Hi xxxss,
Can you check tegrastats of both cases?

#4
Posted 12/27/2017 01:57 AM   
1 cam: [code] RAM 2488/3995MB (lfb 19x4MB) cpu [60%,25%,60%,29%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [20%,35%,57%,71%]@1734 GR3D 65%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [20%,28%,28%,99%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [26%,30%,23%,99%]@1734 GR3D 77%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [28%,29%,78%,47%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [27%,29%,99%,30%]@1734 GR3D 88%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [29%,28%,97%,31%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [32%,32%,73%,47%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [34%,37%,41%,61%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [34%,50%,64%,30%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [87%,37%,32%,29%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [68%,35%,43%,34%]@1734 GR3D 65%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [26%,30%,97%,30%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [36%,33%,73%,31%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [27%,55%,40%,51%]@1734 GR3D 94%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [26%,30%,24%,97%]@1734 GR3D 99%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [44%,27%,32%,77%]@1734 GR3D 86%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [34%,27%,28%,99%]@1734 GR3D 66%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [34%,31%,31%,86%]@1734 GR3D 42%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [32%,34%,34%,83%]@1734 GR3D 41%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [37%,43%,24%,75%]@1734 GR3D 53%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [49%,35%,34%,65%]@1734 GR3D 82%@998 EDP limit 0 RAM 2489/3995MB (lfb 19x4MB) cpu [54%,44%,38%,36%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [30%,38%,27%,89%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [30%,33%,27%,95%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [31%,36%,56%,62%]@1734 GR3D 60%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [71%,43%,38%,24%]@1734 GR3D 64%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [97%,32%,23%,20%]@1734 GR3D 86%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [59%,44%,41%,42%]@1734 GR3D 78%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [27%,35%,43%,73%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [31%,30%,69%,53%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [38%,29%,92%,32%]@1734 GR3D 99%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [37%,33%,98%,21%]@1734 GR3D 99%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [35%,37%,92%,23%]@1734 GR3D 99%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [34%,40%,86%,24%]@1734 GR3D 99%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [42%,25%,82%,33%]@1734 GR3D 57%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [33%,27%,23%,86%]@1734 GR3D 29%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [40%,46%,31%,69%]@1734 GR3D 47%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [51%,36%,54%,38%]@1734 GR3D 65%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [28%,34%,27%,91%]@1734 GR3D 59%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [35%,35%,49%,71%]@1734 GR3D 87%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [42%,47%,47%,41%]@1734 GR3D 99%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [29%,30%,89%,35%]@1734 GR3D 75%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [47%,50%,52%,35%]@1734 GR3D 67%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [35%,33%,94%,19%]@1734 GR3D 99%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [30%,25%,96%,28%]@1734 GR3D 99%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [27%,22%,97%,28%]@1734 GR3D 94%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [39%,64%,50%,26%]@1734 GR3D 75%@998 EDP limit 0 RAM 2487/3995MB (lfb 19x4MB) cpu [29%,37%,84%,32%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [27%,34%,97%,23%]@1734 GR3D 99%@998 EDP limit 0 RAM 2488/3995MB (lfb 19x4MB) cpu [82%,28%,33%,27%]@1734 GR3D 77%@998 EDP limit 0 [/code] 4 cam but only 1 of them detection: [code] RAM 3329/3995MB (lfb 18x4MB) cpu [69%,74%,69%,86%]@1734 GR3D 81%@998 EDP limit 0 RAM 3329/3995MB (lfb 18x4MB) cpu [76%,81%,69%,74%]@1734 GR3D 99%@998 EDP limit 0 RAM 3329/3995MB (lfb 18x4MB) cpu [68%,73%,73%,86%]@1734 GR3D 99%@998 EDP limit 0 RAM 3328/3995MB (lfb 18x4MB) cpu [84%,75%,69%,76%]@1734 GR3D 5%@998 EDP limit 0 RAM 3329/3995MB (lfb 18x4MB) cpu [84%,81%,71%,64%]@1734 GR3D 99%@998 EDP limit 0 RAM 3329/3995MB (lfb 18x4MB) cpu [74%,76%,73%,78%]@1734 GR3D 99%@998 EDP limit 0 RAM 3329/3995MB (lfb 18x4MB) cpu [79%,79%,86%,66%]@1734 GR3D 95%@998 EDP limit 0 RAM 3332/3995MB (lfb 18x4MB) cpu [75%,80%,75%,78%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [76%,76%,81%,71%]@1734 GR3D 30%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [76%,79%,82%,68%]@1734 GR3D 97%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [77%,71%,73%,70%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [85%,78%,70%,75%]@1734 GR3D 96%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [76%,74%,72%,79%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [77%,70%,73%,81%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [72%,77%,74%,81%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [74%,77%,71%,79%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [82%,78%,77%,68%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [68%,74%,71%,78%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [72%,71%,74%,87%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [62%,82%,74%,79%]@1734 GR3D 8%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [82%,77%,70%,78%]@1734 GR3D 41%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [70%,85%,74%,75%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [73%,70%,77%,76%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [77%,72%,78%,75%]@1734 GR3D 0%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [85%,76%,72%,67%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [84%,74%,73%,66%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [90%,68%,76%,71%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [86%,80%,74%,66%]@1734 GR3D 87%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [73%,86%,78%,76%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [75%,88%,73%,68%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [67%,76%,76%,86%]@1734 GR3D 51%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [69%,82%,72%,77%]@1734 GR3D 3%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [65%,80%,78%,72%]@1734 GR3D 40%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [77%,76%,80%,73%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [86%,70%,79%,74%]@1734 GR3D 90%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [67%,84%,77%,74%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [77%,77%,76%,76%]@1734 GR3D 3%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [84%,79%,78%,67%]@1734 GR3D 3%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [66%,64%,75%,95%]@1734 GR3D 19%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [73%,76%,72%,78%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [77%,75%,76%,74%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [67%,70%,83%,83%]@1734 GR3D 3%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [70%,72%,78%,82%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [70%,75%,79%,74%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [74%,73%,89%,71%]@1734 GR3D 11%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [69%,80%,74%,81%]@1734 GR3D 38%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [72%,71%,76%,80%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [78%,77%,70%,74%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [80%,77%,80%,67%]@1734 GR3D 8%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [84%,74%,75%,72%]@1734 GR3D 99%@998 EDP limit 0 RAM 3331/3995MB (lfb 18x4MB) cpu [74%,78%,78%,75%]@1734 GR3D 99%@998 EDP limit 0 [/code]
1 cam:

RAM 2488/3995MB (lfb 19x4MB) cpu [60%,25%,60%,29%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [20%,35%,57%,71%]@1734 GR3D 65%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [20%,28%,28%,99%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [26%,30%,23%,99%]@1734 GR3D 77%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [28%,29%,78%,47%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [27%,29%,99%,30%]@1734 GR3D 88%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [29%,28%,97%,31%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [32%,32%,73%,47%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [34%,37%,41%,61%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [34%,50%,64%,30%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [87%,37%,32%,29%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [68%,35%,43%,34%]@1734 GR3D 65%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [26%,30%,97%,30%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [36%,33%,73%,31%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [27%,55%,40%,51%]@1734 GR3D 94%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [26%,30%,24%,97%]@1734 GR3D 99%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [44%,27%,32%,77%]@1734 GR3D 86%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [34%,27%,28%,99%]@1734 GR3D 66%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [34%,31%,31%,86%]@1734 GR3D 42%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [32%,34%,34%,83%]@1734 GR3D 41%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [37%,43%,24%,75%]@1734 GR3D 53%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [49%,35%,34%,65%]@1734 GR3D 82%@998 EDP limit 0
RAM 2489/3995MB (lfb 19x4MB) cpu [54%,44%,38%,36%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [30%,38%,27%,89%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [30%,33%,27%,95%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [31%,36%,56%,62%]@1734 GR3D 60%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [71%,43%,38%,24%]@1734 GR3D 64%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [97%,32%,23%,20%]@1734 GR3D 86%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [59%,44%,41%,42%]@1734 GR3D 78%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [27%,35%,43%,73%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [31%,30%,69%,53%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [38%,29%,92%,32%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [37%,33%,98%,21%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [35%,37%,92%,23%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [34%,40%,86%,24%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [42%,25%,82%,33%]@1734 GR3D 57%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [33%,27%,23%,86%]@1734 GR3D 29%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [40%,46%,31%,69%]@1734 GR3D 47%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [51%,36%,54%,38%]@1734 GR3D 65%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [28%,34%,27%,91%]@1734 GR3D 59%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [35%,35%,49%,71%]@1734 GR3D 87%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [42%,47%,47%,41%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [29%,30%,89%,35%]@1734 GR3D 75%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [47%,50%,52%,35%]@1734 GR3D 67%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [35%,33%,94%,19%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [30%,25%,96%,28%]@1734 GR3D 99%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [27%,22%,97%,28%]@1734 GR3D 94%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [39%,64%,50%,26%]@1734 GR3D 75%@998 EDP limit 0
RAM 2487/3995MB (lfb 19x4MB) cpu [29%,37%,84%,32%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [27%,34%,97%,23%]@1734 GR3D 99%@998 EDP limit 0
RAM 2488/3995MB (lfb 19x4MB) cpu [82%,28%,33%,27%]@1734 GR3D 77%@998 EDP limit 0


4 cam but only 1 of them detection:
RAM 3329/3995MB (lfb 18x4MB) cpu [69%,74%,69%,86%]@1734 GR3D 81%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [76%,81%,69%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [68%,73%,73%,86%]@1734 GR3D 99%@998 EDP limit 0
RAM 3328/3995MB (lfb 18x4MB) cpu [84%,75%,69%,76%]@1734 GR3D 5%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [84%,81%,71%,64%]@1734 GR3D 99%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [74%,76%,73%,78%]@1734 GR3D 99%@998 EDP limit 0
RAM 3329/3995MB (lfb 18x4MB) cpu [79%,79%,86%,66%]@1734 GR3D 95%@998 EDP limit 0
RAM 3332/3995MB (lfb 18x4MB) cpu [75%,80%,75%,78%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [76%,76%,81%,71%]@1734 GR3D 30%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [76%,79%,82%,68%]@1734 GR3D 97%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,71%,73%,70%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [85%,78%,70%,75%]@1734 GR3D 96%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [76%,74%,72%,79%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,70%,73%,81%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [72%,77%,74%,81%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [74%,77%,71%,79%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [82%,78%,77%,68%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [68%,74%,71%,78%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [72%,71%,74%,87%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [62%,82%,74%,79%]@1734 GR3D 8%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [82%,77%,70%,78%]@1734 GR3D 41%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [70%,85%,74%,75%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [73%,70%,77%,76%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,72%,78%,75%]@1734 GR3D 0%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [85%,76%,72%,67%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [84%,74%,73%,66%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [90%,68%,76%,71%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [86%,80%,74%,66%]@1734 GR3D 87%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [73%,86%,78%,76%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [75%,88%,73%,68%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [67%,76%,76%,86%]@1734 GR3D 51%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [69%,82%,72%,77%]@1734 GR3D 3%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [65%,80%,78%,72%]@1734 GR3D 40%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,76%,80%,73%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [86%,70%,79%,74%]@1734 GR3D 90%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [67%,84%,77%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,77%,76%,76%]@1734 GR3D 3%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [84%,79%,78%,67%]@1734 GR3D 3%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [66%,64%,75%,95%]@1734 GR3D 19%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [73%,76%,72%,78%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [77%,75%,76%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [67%,70%,83%,83%]@1734 GR3D 3%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [70%,72%,78%,82%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [70%,75%,79%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [74%,73%,89%,71%]@1734 GR3D 11%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [69%,80%,74%,81%]@1734 GR3D 38%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [72%,71%,76%,80%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [78%,77%,70%,74%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [80%,77%,80%,67%]@1734 GR3D 8%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [84%,74%,75%,72%]@1734 GR3D 99%@998 EDP limit 0
RAM 3331/3995MB (lfb 18x4MB) cpu [74%,78%,78%,75%]@1734 GR3D 99%@998 EDP limit 0

#5
Posted 12/27/2017 02:13 AM   
Hi xxxss, Please try not to map buffers to CPU: [code] ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, [b]false[/b], false);[/code]
Hi xxxss,
Please try not to map buffers to CPU:
ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, false, false);

#6
Posted 12/27/2017 03:10 AM   
Could you tell me how to fix this? I don't know how and why.
Could you tell me how to fix this? I don't know how and why.

#7
Posted 12/27/2017 05:30 AM   
Hi xxxsss, We don't have test app to reproduce the issue, but from the code you attached, it looks like you can disable mmap in capture_plane for a try. Could you give it a try?
Hi xxxsss,
We don't have test app to reproduce the issue, but from the code you attached, it looks like you can disable mmap in capture_plane for a try. Could you give it a try?

#8
Posted 12/27/2017 05:42 AM   
I don't know how to disable mmap in capture_plane, so I tried remove this line. It's not work. Could you give me an example how to disable it ?
I don't know how to disable mmap in capture_plane, so I tried remove this line. It's not work.
Could you give me an example how to disable it ?

#9
Posted 12/28/2017 07:29 AM   
Hi xxxss, [code]/** * Helper method that encapsulates all the method calls required to * set up the plane for streaming. * * Calls reqbuf internally. Then, for each of the buffers, calls #queryBuffer, * #exportBuffer and maps the buffer/allocates the buffer memory depending * on the memory type. * * @sa deinitPlane * * @param[in] mem_type V4L2 Memory to use on the buffer. * @param[in] num_buffers Number of buffer to request on the plane. * @param[in] map boolean value indicating if the buffers should be mapped to memory (Only for V4L2_MEMORY_MMAP). * @param[in] allocate boolean valued indicating whether the buffers should be allocated memory (Only for V4L2_MEMORY_USERPTR). * @return 0 for success, -1 otherwise. */ int setupPlane(enum v4l2_memory mem_type, uint32_t num_buffers, bool map, bool allocate);[/code] The following line sets bool map=false: [code]ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, [b]false[/b], false);[/code]
Hi xxxss,
/**
* Helper method that encapsulates all the method calls required to
* set up the plane for streaming.
*
* Calls reqbuf internally. Then, for each of the buffers, calls #queryBuffer,
* #exportBuffer and maps the buffer/allocates the buffer memory depending
* on the memory type.
*
* @sa deinitPlane
*
* @param[in] mem_type V4L2 Memory to use on the buffer.
* @param[in] num_buffers Number of buffer to request on the plane.
* @param[in] map boolean value indicating if the buffers should be mapped to
memory (Only for V4L2_MEMORY_MMAP).
* @param[in] allocate boolean valued indicating whether the buffers should be
allocated memory (Only for V4L2_MEMORY_USERPTR).
* @return 0 for success, -1 otherwise.
*/
int setupPlane(enum v4l2_memory mem_type, uint32_t num_buffers, bool map, bool allocate);


The following line sets bool map=false:
ret = m_ImageConverter->capture_plane.setupPlane(V4L2_MEMORY_MMAP, conv_buf_num, false, false);

#10
Posted 12/28/2017 07:41 AM   
Sorry for that. I tried this. The result of nvprofile doesn't change much.
Sorry for that. I tried this. The result of nvprofile doesn't change much.

#11
Posted 12/28/2017 08:40 AM   

#12
Posted 12/28/2017 08:41 AM   
Hi xxxss, We will try to simulate and reproduce the issue, but we don't have the test app. It will take some time. So you see cudaMemcpy() is slow in running 4 cameras like: Argus camera 1 -> NvVideoConverter -> cudaMemcpy() Argus camera 2 -> NvVideoConverter -> do nothing Argus camera 3 -> NvVideoConverter -> do nothing Argus camera 4 -> NvVideoConverter -> do nothing We don't have your detection algorithm, so it will be skipped in the simulation.
Hi xxxss,
We will try to simulate and reproduce the issue, but we don't have the test app. It will take some time.

So you see cudaMemcpy() is slow in running 4 cameras like:
Argus camera 1 -> NvVideoConverter -> cudaMemcpy()
Argus camera 2 -> NvVideoConverter -> do nothing
Argus camera 3 -> NvVideoConverter -> do nothing
Argus camera 4 -> NvVideoConverter -> do nothing

We don't have your detection algorithm, so it will be skipped in the simulation.

#13
Posted 12/28/2017 09:12 AM   
We create a sample base on our project. You can try this. If start with 4 cameras, the second cudaMemcpy cost 700us~1200us. (profile_4cam_201712291135) If run with only 1 camera, the second cudaMemcpy cost 400us~500us. (profile_1cam_201712291137) I run the nvprof by this command: [code] nvprof --print-api-trace -o profile_4cam_201712291135 --log-file 201712291135.log ./detection -c0 -c1 -c2 -c3 [/code] or [code] nvprof --print-api-trace -o profile_4cam_201712291137 --log-file 201712291137.log ./detection -c1 [/code]
We create a sample base on our project. You can try this.
If start with 4 cameras, the second cudaMemcpy cost 700us~1200us. (profile_4cam_201712291135)
If run with only 1 camera, the second cudaMemcpy cost 400us~500us. (profile_1cam_201712291137)

I run the nvprof by this command:
nvprof --print-api-trace -o profile_4cam_201712291135 --log-file 201712291135.log ./detection -c0 -c1 -c2 -c3


or
nvprof --print-api-trace -o profile_4cam_201712291137 --log-file 201712291137.log ./detection -c1
Attachments

xxxss_sample.tar.gz

#14
Posted 12/29/2017 07:47 AM   
Hi xxxss, What is the purpose of copying data from GPU to CPU( cudaMemcpy(output, x_gpu, size, cudaMemcpyDeviceToHost) )? Your detection algorithm is running on CPU?
Hi xxxss,
What is the purpose of copying data from GPU to CPU( cudaMemcpy(output, x_gpu, size, cudaMemcpyDeviceToHost) )? Your detection algorithm is running on CPU?

#15
Posted 01/02/2018 04:46 AM   
Scroll To Top

Add Reply