Latency proportionally increases with batch size

Hi all,

I encounter the following issue: increasing the batch size leads to a proportional increase in latency.

I’m using TRT 5.1.5.0, C++ API, and converted the network from UFF.

Inference times:
Batch size 1: 12.7ms
Batch size 2: 25.2ms
Batch size 3: 37.5ms

However, the SDK documentation implies that increasing the batch size should not have large impact on the latency. The documenation states: Often the time taken to compute results for batch size N=1 is almost identical to batch sizes up to N=16 or N=32. ([url]https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#batching[/url])

Is the documentation wrong or am I missing something?

Same issue here. Using TRT 7.2.3.4 converted with torch2trt, running on triton inference server 21.05-py3-sdk. these are results from perf_analyzer:

batch size 1: 49.4 infer/sec, latency 20488 usec
batch size 2: 49.2 infer/sec, latency 41419 usec
batch size 3: 48.6 infer/sec, latency 62209 usec

it’s interesting the infer/sec is about the same but that latency is larger.

the model is batch-size agnostic when running as trtmodule with torch2trt. I set max_workspace_size to 8G, which didn’t help.

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!