TF-TRT5: How to run tensorflow-tensorrt inferences with multiple GPUs

sanchezvr7 · October 19, 2018, 8:39pm

hi all, I am using integration TF-TRT5 via container image nvcr.io/nvidia/tensorflow:18.09-py3.

I want to run the python sample inference.py with multiple GPUs. I have done the below modification but it is not working

Modified code

estimator = tf.estimator.Estimator(
        model_fn=tf.contrib.estimator.replicate_model_fn(model_fn),
        config=tf.estimator.RunConfig(session_config=tf_config),
        params=dict(batch_size=batch_size))

Could you please provide some sample that implements tensorflow-tensorrt inference with multiple GPUs?

Thanks,

Vilmara

NVES · October 19, 2018, 10:21pm

Hello,

Would the TRT Inference Server be appropriate?

sanchezvr7 · October 22, 2018, 4:27pm

Hi, thanks for your prompt reply.

I have explored TRT Inference Server container, but it doesn’t show the code where the Inference Server distributes the inferencing across all system GPUs.

Some other suggestion?. Thanks

NVES · October 22, 2018, 4:45pm

Hello,

Can you describe your use case?

TRTIS (TensorRT Inference Server) supports multiple GPUs but it does not support running a single inference distributed across multiple GPUs. TRTIS can run multiple models (and/or multiple instances of the same model) on multiple GPUs to increase throughput.

sanchezvr7 · October 22, 2018, 5:12pm

Hi,

I am using integration TF-TRT5 via container image nvcr.io/nvidia/tensorflow:18.09-py3 to run the sample inference.py which shows the inference performance using tf-trt (throughput and latency). I realized the script only uses 1 GPUs to do the calculation. I need some example of running a single inference distributed across multiple GPUs to maximize the performance and system capabilities (with 3 GPUs)

Here is the environment I am using:
container image: nvcr.io/nvidia/tensorflow:18.09-py3
top-level directory: /workspace/nvidia-examples/tftrt/scripts
GPU model and memory: 3x Tesla P4 - 7GB
Command to reproduce: python3 inference.py --model resnet_v1_50 --batch_size 1 --use_trt --precision fp32

Thanks

NVES · October 22, 2018, 7:35pm

Hello,

can you elaborate on “to maximize the performance and system capabilities”? Is it because your model takes more than 7GB that they can’t fit it in one P4?

NVES · October 22, 2018, 7:38pm

FYI. If your goal is to have trt engine doing gpu-to-gpu communications, it is not possible, at least for now.

If you want to run same graph in multiple gpus to increase the throughput, then you’ll need native tf.

sanchezvr7 · October 22, 2018, 7:50pm

Right, I want to test the server showing its max capability in terms of GPU utilization. For instance, when I ran the test with 128 batch size, it hit 100% of GPU utilization, so with batch sizes over 128 the script should use the rest of the GPUs available in the system and increases throughput

sanchezvr7 · October 23, 2018, 5:32pm

I got it. Is there a sample code that shows how TRTIS runs multiple models (and/or multiple instances of the same model) on multiple GPUs to increase throughput?. I would like to add this functionality to my code. Thanks

NVES · October 24, 2018, 3:21pm

Hello,

The open-source repo of client code https://github.com/NVIDIA/dl-inference-server and examples has everything you need I believe. The instructions there tell you how to build perf_client and also there is a sample model that configurations 4 model instances. The instructions tell how to use the example model store. The actual config file is: https://github.com/NVIDIA/dl-inference-server/blob/18.09/examples/models/resnet50_netdef/config.pbtxt. You can modify that to use different numbers of model instances.

The current user guide doesn’t have the section they would be interested in, but the 18.10 user guide will. You could instead look at this blog: https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/ in the Performance section is discussion about model instances.

yh01 · September 3, 2019, 1:22pm

Hello,

Currently I am using TensorRT for model inference on one GPU and would like to have multi-gpu utilization. Is it not possible to run one model instance on multi-gpus? For example, instead of consuming 15 ms for one frame using one GPU, is it possible to use two GPUs to get less than 15 ms time consumption?

In my case, the batch_size here is 1. If not, what is the best way of utilizing multi-gpu for one real-time video stream to get less inference time per frame?

Thanks.