Speed of SSD on TX1.

saikumar.gadde · February 22, 2018, 8:37pm

Hi,

AastaLL has benchmarked speed of SSD on TX2 in various posts around 8-9 frames per second. Can you benchmark SSD on Jetson TX1? For our reference. I get only 5 frames per second on TX1.

Thank you.

AastaLLL · February 26, 2018, 2:23am

Hi,

1. Please remember to maximize the CPU/GPU clock via this command:

sudo ~/jetson_clocks.sh

2. The result of 8-9 fps is from ssd_pascal_video.py.
If you are using ssd_pascal_webcam.py script, it’s expected to have 5 fps due to some camera overhead.

Thanks.

saikumar.gadde · February 27, 2018, 8:11pm

Hi,

Is that after you build a tensorRT engine or is it by just plainly using the ~/jetson_clocks.sh and running the caffe script?

Thankyou

linuxdev · February 27, 2018, 9:56pm

Run “~ubuntu/jetson_clocks.sh” for options, e.g., show. Normally running this sets clocks to max, and then it reverts on reboot. You can use “–store ” to memorize a particular setting, and then “–restore ” to put the Jetson at that setting. So you might run this right before you start testing anything. It is a general clock boost utility, it isn’t specific to tensorRT or caffe (if the cooling fan wasn’t on before, then it probably will be after).

Don’t forget to use sudo.

saikumar.gadde · April 3, 2018, 4:09pm

Hi,

Before using jetson_clocks.sh, I got 3.2 fps, and after using ./jetson_clocks.sh, the speed remained the same at 3.2 fps, but the fan started rotating and two other cores have been activated. To explain my process my briefly,

I load the image using opencv. It loads at 35 fps on average.
I am passing this to a caffe net object and do caffe.forward() for the forward inference.
The above two steps are in a while loop in the main.cpp
The whole code is in the C++ unlike the ssd_pascal_voc.py script used for benchmarking.

What might be the reason for no improvement in the performance? Is it because, the pipeline is running on a single thread and the caffe is not able to spawn required number of threads ?

Thank you

linuxdev · April 3, 2018, 8:10pm

I have insufficient knowledge to give you a good answer. Usually though this works out to how the data is arranged…CUDA is highly dependent on being able to feed a large number of kernels simultaneously (not unlike trying to create one thread for each CPU core…having the right number of threads maximizes compute power and changes depending on core count…but you are doing this with CUDA cores instead). Someone else may be able to help if you describe things like image size and layout.

AastaLLL · April 10, 2018, 4:00am

Hi,

Have you set Caffe into GPU mode?
Suppose there are some memcpy in your workflow.

To avoid this, it’s recommended to read camera to GPU accessible buffer, and feed it into Caffe GPU buffer directly:

Caffe::set_mode(Caffe::GPU);
...
Blob<float>* input_layer = classifier->net_->input_blobs()[0];
float* input = input_layer->mutable_gpu_data();

Thanks.