context->setBindingDimensions casing gpu memory leak

context->setBindingDimensions

Would case gpu memory leak.

@NVES_R

for(int i=0; i < 1000000; ++i) {
    context->setBindingDimensions
    context->enqueueV2(buffers, stream, nullptr);
}

Would case gpu memory leak.

1 Like

Hi,

Can you provide the full .cpp script for this small example?

Thanks,
NVIDIA Enterprise Support

You could just use any *.engine file with dynamic/fixed input size, and use the following code to find that the gpu memory usage is always rising.

class MyArray {
    public:
        int inp_size{0};
        int out_size{0};
        int inp_bytes{0};
        int out_bytes{0};
        int time_step{0};
        Dims inp_dims{};
        Dims out_dims{};
        vector<float> inp;
        vector<float> out;

        void set_is_dynamic(bool is_dynamic) {
            if(is_dynamic) {
                inp = vector<float>(MAX_INP_SIZE);
                out = vector<float>(MAX_OUT_SIZE);
            }
            else {
                inp = vector<float>(FIX_INP_SIZE);
                out = vector<float>(FIX_OUT_SIZE);
            }
        }

        void setInpDims(const Dims &dims) {
            inp_size = 1;
            for(int i=0; i < dims.nbDims; ++i) {
                inp_size *= dims.d[i];
            }
            inp_dims = dims;
            inp_bytes = inp_size * sizeof(float);
        }

        void setOutDims(const Dims &dims) {
            out_size = 1;
            for(int i=0; i < dims.nbDims; ++i) {
                out_size *= dims.d[i];
            }
            out_dims = dims;
            out_bytes = out_size * sizeof(float);
        }
};

////////////////////////////////////////////////////////////////

class TrtHelper {
    public:
        TrtHelper(string engine_path, bool is_dynamic) {
            runtime = createInferRuntime(gLogger);
            assert(runtime != nullptr);

            stringstream ss;
            ss.seekg(0, ss.beg);
            ifstream cache(engine_path);
            ss << cache.rdbuf();
            cache.close();
            ss.seekg(0, std::ios::end);
            cint size = ss.tellg();
            ss.seekg(0, std::ios::beg);
            void* memory = malloc(size);
            ss.read((char*)memory, size);
            engine = runtime->deserializeCudaEngine(memory, size, nullptr);
            free(memory);
            assert(engine != nullptr);

            context = engine->createExecutionContext();
            assert(context != nullptr);
            if(is_dynamic) { context->setOptimizationProfile(0); }

            CHECK(cudaStreamCreate(&stream));
            cout << "Loaded " << engine_path << " ..." << endl;

            assert(engine->getNbBindings() == 2);
            inpIndex = engine->getBindingIndex(INP_NAME.c_str());
            outIndex = engine->getBindingIndex(OUT_NAME.c_str());

            auto sf = sizeof(float);
            if(is_dynamic) {
                CHECK(cudaMalloc(&buffers[inpIndex], MAX_INP_SIZE*sf));
                CHECK(cudaMalloc(&buffers[outIndex], MAX_OUT_SIZE*sf));
            }
            else {
                CHECK(cudaMalloc(&buffers[inpIndex], FIX_INP_SIZE*sf));
                CHECK(cudaMalloc(&buffers[outIndex], FIX_OUT_SIZE*sf));
            }

            cout << "Created cuda buffers ..." << endl;
        }

        ~TrtHelper() {
            if(context) { context->destroy(); }
            if(engine) { engine->destroy(); }
            if(runtime) { runtime->destroy(); }
            if(stream) { cudaStreamDestroy(stream); }

            if(buffers[0]) { CHECK(cudaFree(buffers[inpIndex])); buffers[0] = nullptr;}
            if(buffers[1]) { CHECK(cudaFree(buffers[outIndex])); buffers[1] = nullptr;}
        }

        void inference(MyArray &myArray) {
            context->setBindingDimensions(inpIndex, myArray.inp_dims);
            myArray.setOutDims(context->getBindingDimensions(outIndex));

            CHECK(cudaMemcpyAsync(buffers[inpIndex], myArray.inp.data(),
                        myArray.inp_bytes, cudaMemcpyHostToDevice, stream));
cout << "Memory debug beg ..." << endl;
for(int i=0; i < 1000000000; ++i) {
context->setBindingDimensions(inpIndex, myArray.inp_dims);
            context->enqueueV2(buffers, stream, nullptr);
cudaStreamSynchronize(stream);
}
cout << "Memory debug end ..." << endl;
            CHECK(cudaMemcpyAsync(myArray.out.data(), buffers[outIndex],
                        myArray.out_bytes, cudaMemcpyDeviceToHost, stream));
            cudaStreamSynchronize(stream);
        }

    private:
        int inpIndex{0};
        int outIndex{0};
        void* buffers[2]{};

        cudaStream_t stream{nullptr};
        IRuntime *runtime{nullptr};
        ICudaEngine *engine{nullptr};
        IExecutionContext *context{nullptr};
};

ImgHelper img(BLOCK_W, max_inp_w, FIX_INP_H, PAD_VALUE);
        for(auto fip: load_files(input_path)) {
            // cout << ip << endl;
            if(!img.load_img(fip, myArray, is_dynamic)) {
                continue;
            }

            trt.inference(myArray);
            string text = ctc.greedy_ctc(myArray);
            cout << "The text of file '" << fip << "' is: " << text << endl;
        }

I could not provide the whole code, but I think the above code is enough as an example => You could simply take !img.load_img(fip, myArray, is_dynamic) as setting image data to myArray.inp.

Case 1:

for(int i=0; i < 1000000; ++i) {
    context->setBindingDimensions
    context->enqueueV2(buffers, stream, nullptr);
}

Case 2:

for(int i=0; i < 1000000; ++i) {
    context->enqueueV2(buffers, stream, nullptr);
}

Case 3:

for(int i=0; i < 1000000; ++i) {
    context->setBindingDimensions
}

Only case 1 would cause memory usage to raise, and case 2 and case 3 is ok.

Just got an answer, they caught this in a similar bug and have fixed it for the next release. Thanks for pointing this out.

Is the fixed version released yet ?

Hi yfjiaren,

Not yet. Sorry, but I can’t release the timeline.

Gpu memory leak is fixed on TensorRT7,
but, slow performance is Same as before.

Hi yfjiaren,

TRT7 has been released and should fix the memory leak issue.

I found tensorRT7 also has this problem.

I use python api and every time I call context.set_binding_shape and context.execute_async_v2, the gpu memory grows utill out of memory.

But it didn’t happen to one of my other simple network.

Gpu memory leak is fixed on TensorRT7,
but, slow performance is Same as before.
I use P100

@NVES_R, @yfjiaren, @nilshinn

Indeed, ResNet-like heavy networks built by TRT python api do make GPU memory leak each time when I call context.set_binding_shape and context.execute_async_v2. Thus, TensorRT7.0 still has this problem. Besides, the environment is as follows:

  • Tesla P4
  • CUDA 10.0.130
  • TensorRT 7.0.0.11

Update

Tesla P4 do memory leak. RTX 2080TI do not leak.

CUDA 10.0.130
TensorRT 7.0.0.11
gtx 1060
retinanet-resnet50

update:
V100 do not leak but is slow(same as pytorch)

Thanks for the updates everyone, looking into this.

This issue has been fixed upstream and should be included in the next release.

Could you release a hotfix? This is blocking production code for us. Thank you

2 Likes

@NVES_R
When will this fix be released? Can a trt 7 hotfix be released prior to trt8?

We are a big customer unable to upgrade from older tensor-rt in released products due to this issue causing application instability.

2 Likes

@NVES_R I would love to see that fix release very soon too, we’re having this problem and it’s blocking us to upgrade to TRT7.

1 Like