We built a object detection system of 4 cameras with TX1.
If we run application with only 1 camera active and detection. the whole loop of detection can be run 21 times in 1 second.
When four cameras are streaming and only one of them are used for detection algorithm, the detection can only be done 11 times in 1 second.
I tried nvprof. Below is result of 1 camera streaming:
==7317== Profiling application: ./detection -c1
==7317== Profiling result:
Time(%) Time Calls Avg Min Max Name
21.90% 33.1770s 3747 8.8543ms 7.4923ms 14.441ms void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>,
float const *, float const *, float, float, int)
15.23% 23.0770s 26229 879.83us 70.940us 2.1807ms maxwell_sgemm_128x64_raggedMn_nn
12.04% 18.2380s 33723 540.82us 59.430us 3.5146ms im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*)
7.26% 10.9959s 93676 117.38us 1.8750us 1.9100ms fill_kernel(int, float, float*, int)
7.07% 10.7105s 29976 357.30us 28.282us 3.2339ms normalize_kernel(int, float*, float*, float*, int, int, int)
6.70% 10.1537s 71193 142.62us 1.6660us 2.0321ms activate_array_kernel(float*, int, ACTIVATION)
6.08% 9.21664s 33723 273.30us 5.0520us 2.0859ms add_bias_kernel(float*, float*, int, int, int)
5.77% 8.74433s 3747 2.3337ms 2.0663ms 3.9664ms void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *,
int, float*, int, int, int, float const *, float const *, float, float, int)
5.39% 8.16327s 33723 242.07us 2.5000us 2.1818ms copy_kernel(int, float*, int, int, float*, int, int)
5.36% 8.11565s 29976 270.74us 28.908us 2.4257ms scale_bias_kernel(float*, float*, int, int)
4.91% 7.43280s 22482 330.61us 57.034us 1.7648ms forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*)
1.25% 1.89259s 3747 505.09us 353.56us 1.0201ms convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int)
1.04% 1.57767s 11451 137.78us 207ns 39.722ms [CUDA memcpy HtoH]
0.02% 29.780ms 3747 7.9470us 7.2930us 12.344us softmax_kernel(float*, int, int, int, int, int, int, float, float*)
0.00% 1.1460us 1 1.1460us 1.1460us 1.1460us [CUDA memcpy HtoD]
==7317== API calls:
Time(%) Time Calls Avg Min Max Name
77.96% 124.254s 11452 10.850ms 37.033us 241.49ms cudaMemcpy
15.53% 24.7526s 389689 63.518us 29.950us 14.282ms cudaLaunch
1.66% 2.64968s 3747 707.15us 47.606us 7.9432ms cudaStreamSynchronize
1.15% 1.83160s 2577940 710ns 416ns 2.7011ms cudaSetupArgument
1.00% 1.59649s 3748 425.96us 12.032us 1.46191s cudaFree
0.87% 1.38763s 3747 370.33us 156.93us 5.0750ms cuGraphicsEGLRegisterImage
0.30% 482.49ms 3747 128.77us 56.618us 2.9200ms cuGraphicsUnregisterResource
0.27% 423.85ms 389689 1.0870us 469ns 3.0148ms cudaConfigureCall
0.24% 374.82ms 435021 861ns 468ns 2.6683ms cudaGetLastError
0.22% 357.68ms 2 178.84ms 160.33ms 197.35ms cuCtxCreate
0.21% 341.01ms 318496 1.0700us 572ns 2.1004ms cudaPeekAtLastError
0.13% 211.42ms 7494 28.211us 7.7090us 2.1873ms cudaBindTexture
0.12% 188.53ms 157 1.2008ms 54.743us 78.482ms cudaMallocManaged
0.09% 144.28ms 33724 4.2780us 1.6150us 2.7024ms cudaGetDevice
0.07% 111.71ms 1 111.71ms 111.71ms 111.71ms cuCtxDestroy
0.05% 82.356ms 3747 21.979us 11.511us 1.5091ms cudaStreamCreate
0.04% 67.692ms 3747 18.065us 8.3340us 2.3707ms cudaStreamDestroy
0.03% 45.516ms 7494 6.0730us 1.7190us 1.7276ms cudaUnbindTexture
0.02% 31.411ms 3749 8.3780us 4.2710us 1.0104ms cudaSetDevice
0.01% 23.511ms 2 11.755ms 562.53us 22.948ms cudaMallocHost
0.01% 12.465ms 3747 3.3260us 1.6660us 560.76us cuEGLStreamProducerPresentDevicePtr
0.00% 7.5535ms 1 7.5535ms 7.5535ms 7.5535ms cudaDeviceSynchronize
0.00% 1.5153ms 2 757.64us 595.29us 919.99us cudaFreeHost
0.00% 1.1524ms 3 384.14us 41.565us 681.65us cudaMalloc
0.00% 266.52us 261 1.0210us 364ns 53.649us cuDeviceGetAttribute
0.00% 50.316us 16 3.1440us 1.8230us 14.272us cudaEventCreateWithFlags
0.00% 45.993us 3 15.331us 8.4900us 27.137us cuDeviceTotalMem
0.00% 42.086us 1 42.086us 42.086us 42.086us cudaGetDeviceProperties
0.00% 16.044us 11 1.4580us 938ns 5.2610us cudaDeviceGetAttribute
0.00% 14.064us 4 3.5160us 1.7190us 6.0940us cuInit
0.00% 13.751us 7 1.9640us 677ns 6.3030us cuDeviceGetCount
0.00% 10.417us 1 10.417us 10.417us 10.417us cudaSetDeviceFlags
0.00% 7.3440us 4 1.8360us 1.0420us 2.6040us cuDeviceGetName
0.00% 6.9280us 7 989ns 573ns 1.8230us cuDeviceGet
0.00% 6.3550us 1 6.3550us 6.3550us 6.3550us cudaGetDeviceCount
0.00% 5.5210us 3 1.8400us 885ns 3.3860us cuDriverGetVersion
0.00% 3.2810us 2 1.6400us 1.3020us 1.9790us cuCtxSetCurrent
Here is 4 cameras streaming, only 1 detecting:
==7615== Profiling application: ./detection -c0 -c1 -c2 -c3
==7615== Profiling result:
Time(%) Time Calls Avg Min Max Name
22.71% 42.5082s 2695 15.773ms 7.6831ms 18.547ms void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=7, int=4>(cublasGemmSmallNParams<float, float, float>,
float const *, float const *, float, float, int)
13.19% 24.6859s 18865 1.3086ms 73.074us 5.0093ms maxwell_sgemm_128x64_raggedMn_nn
12.61% 23.6057s 24255 973.23us 59.064us 6.5400ms im2col_gpu_kernel(int, float const *, int, int, int, int, int, int, int, float*)
7.13% 13.3473s 67376 198.10us 1.9260us 4.4152ms fill_kernel(int, float, float*, int)
6.50% 12.1739s 24255 501.91us 2.3440us 5.0512ms copy_kernel(int, float*, int, int, float*, int, int)
6.44% 12.0468s 2695 4.4701ms 2.0662ms 7.6630ms void magma_lds128_sgemm_kernel<bool=0, bool=0, int=6, int=5, int=3, int=3, int=3>(int, int, int, float const *, int, float const *,
int, float*, int, int, int, float const *, float const *, float, float, int)
6.25% 11.7044s 21560 542.88us 28.593us 5.5883ms normalize_kernel(int, float*, float*, float*, int, int, int)
6.09% 11.3933s 21560 528.45us 29.062us 4.7549ms scale_bias_kernel(float*, float*, int, int)
6.03% 11.2827s 51205 220.34us 1.6660us 4.8346ms activate_array_kernel(float*, int, ACTIVATION)
5.95% 11.1430s 24255 459.41us 5.4690us 4.8695ms add_bias_kernel(float*, float*, int, int, int)
4.89% 9.16334s 16170 566.69us 58.492us 3.7779ms forward_maxpool_layer_kernel(int, int, int, int, int, int, int, float*, float*, int*)
1.30% 2.43517s 2695 903.59us 353.23us 3.2210ms convertIntToFloatKernelShow(unsigned __int64, int, int, void*, int, char*, int)
0.90% 1.67766s 8295 202.25us 208ns 39.597ms [CUDA memcpy HtoH]
0.02% 30.554ms 2695 11.337us 7.6560us 17.345us softmax_kernel(float*, int, int, int, int, int, int, float, float*)
0.00% 1.3550us 1 1.3550us 1.3550us 1.3550us [CUDA memcpy HtoD]
==7615== API calls:
Time(%) Time Calls Avg Min Max Name
74.58% 149.874s 8296 18.066ms 37.501us 240.87ms cudaMemcpy
18.24% 36.6611s 280281 130.80us 30.209us 33.917ms cudaLaunch
2.29% 4.59812s 2695 1.7062ms 58.699us 15.252ms cudaStreamSynchronize
1.14% 2.29234s 2695 850.59us 155.37us 24.756ms cuGraphicsEGLRegisterImage
0.94% 1.88761s 1854164 1.0180us 417ns 9.6395ms cudaSetupArgument
0.85% 1.71233s 2696 635.14us 12.188us 1.54340s cudaFree
0.39% 776.94ms 2695 288.29us 58.439us 13.236ms cuGraphicsUnregisterResource
0.28% 555.74ms 280281 1.9820us 469ns 9.6958ms cudaConfigureCall
0.22% 444.38ms 312989 1.4190us 468ns 5.6845ms cudaGetLastError
0.21% 421.82ms 229076 1.8410us 572ns 5.0321ms cudaPeekAtLastError
0.21% 417.74ms 2 208.87ms 157.93ms 259.80ms cuCtxCreate
0.14% 271.40ms 5390 50.352us 8.6460us 9.5489ms cudaBindTexture
0.11% 217.05ms 24256 8.9480us 1.7180us 4.3617ms cudaGetDevice
0.09% 173.35ms 157 1.1042ms 53.074us 64.627ms cudaMallocManaged
0.08% 152.08ms 8 19.010ms 70.574us 73.692ms cudaMallocHost
0.07% 131.61ms 2695 48.833us 11.979us 8.6015ms cudaStreamCreate
0.06% 119.68ms 2695 44.407us 8.4890us 6.9904ms cudaStreamDestroy
0.06% 115.08ms 1 115.08ms 115.08ms 115.08ms cuCtxDestroy
0.03% 58.050ms 5390 10.770us 1.7180us 4.1501ms cudaUnbindTexture
0.03% 52.572ms 2697 19.492us 4.1150us 5.7802ms cudaSetDevice
0.01% 15.497ms 2695 5.7500us 1.6660us 2.8814ms cuEGLStreamProducerPresentDevicePtr
0.00% 7.6157ms 8 951.96us 317.24us 1.7291ms cudaFreeHost
0.00% 7.4092ms 1 7.4092ms 7.4092ms 7.4092ms cudaDeviceSynchronize
0.00% 918.92us 3 306.31us 42.293us 481.52us cudaMalloc
0.00% 297.92us 261 1.1410us 364ns 50.886us cuDeviceGetAttribute
0.00% 49.060us 16 3.0660us 1.8750us 14.791us cudaEventCreateWithFlags
0.00% 43.021us 3 14.340us 8.9580us 24.323us cuDeviceTotalMem
0.00% 40.887us 1 40.887us 40.887us 40.887us cudaGetDeviceProperties
0.00% 16.249us 11 1.4770us 937ns 5.4160us cudaDeviceGetAttribute
0.00% 14.323us 4 3.5800us 2.5520us 6.1980us cuInit
0.00% 12.448us 7 1.7780us 677ns 5.4690us cuDeviceGetCount
0.00% 8.0720us 4 2.0180us 1.4580us 2.3440us cuDeviceGetName
0.00% 7.4480us 1 7.4480us 7.4480us 7.4480us cudaSetDeviceFlags
0.00% 7.1360us 1 7.1360us 7.1360us 7.1360us cudaGetDeviceCount
0.00% 6.3030us 7 900ns 573ns 1.3030us cuDeviceGet
0.00% 5.8850us 3 1.9610us 937ns 3.3850us cuDriverGetVersion
0.00% 2.7610us 2 1.3800us 1.3550us 1.4060us cuCtxSetCurrent
We can see the Avg cudaMemcpy with 1 cam activate is 10ms and 4cam activate is 18ms.
I thought maybe it cause by initial phase, so I use visual profile to check cost of each loop.
1 cam streaming and detecting:
4 cam streaming and only 1 detecting:
From the result we can see the cost of cudaMemcpy in each loop increase from 33ms to 53ms.
My question is Why multi streaming cameras cause API slow?