NPP row and column filters

Hi everyone!

I am developing a code that has three CPU threads executing the same code:

nppSetStream(stream);
const Npp32s nStep = sizeof(float)*L_pitch;
    NppStatus sp;
    {
        NppiSize szROI = {m_w-nMaskSize, m_h};
        int offset = r;
        sp = nppiFilterRow_32f_C1R(inputImage+offset, nStep, outputImage+offset , nStep, szROI, d_Kernel, nMaskSize, r);
    }
    if (sp != NPP_NO_ERROR)  return;

Of course, each thread has its own variables (stream, nMaskSize…). When the three threads have the same nMaskSize and the same pKernel, they all produce a correct filtered image. However, when at least one thread has a different nMaskSize value and a different pKernel, output_Image is sometimes correct, but other times it is a darker version of the expected output. The result is a “flashing” output video stream.

The most probable hypothesis is that different nppiFilterRow_32f_C1R method calls from different CPU threads and executed on different GPU streams, use the same GPU memory positions to store the filtering kernel and the update process of these memory position is not correctly synchronized between calls. As I have not access to the code, this is just an hypothesis. It would be great is someone with a deeper knowledge of NPP could confirm if this hypothesis is true and if there is any solution to this issue. Thanks in advance!

npp from CUDA 6.5 or newer should be thread safe if you have met some conditions.
You may wish to review pages 1 and 2 (pdf pages 181, 182) of the current documentation:

[url]http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf[/url]

If npp is properly thread safe, and you have structured your code to meet the requirements, your hypothesis as to explanation should not be correct. Defects are always possible, of course.

If you want to provide a short, complete code that someone else could copy, compile, and run directly, and see the issue, it may help.

Hi again. Thanks for your answer txbob. In fact, my previous explanation was not completely correct. After having discussed with other developers in my team, there is only one CPU thread, which is continuously executing image filtering on three different video feeds using three GPU streams.

I have been able to reproduce the problem with a piece of code that reads one image from my computer, and performs filtering of this image 100 times on 3 different GPU streams. Each stream uses a different filter kernel. If you take a look at the output images produced by my code, they are not always similar images despite the fact that the input image is always the same one.

This issue is fixed by setting the same value for radiusFilter1, radiusFilter2 and radiusFilter3. Moreover, the larger the difference between filters radius used, the bigger the intensity difference between output images. This is what makes me conclude that there is a synchronization issue when updating the NPP internal filter kernel between calls with different kernels. In particular, my hypothesis is that sometimes, stream2 is filtering using dKernel1 and radiusFilter2 or dKernel2 and radiusFilter1, instead of the expected dKernel2 and radiusFilter2. This would explain the intensity changes, as the sum of weights of these two combinations is not 1.

This is the main.cpp file that I wrote to reproduce the issue (you should only have to change the input image path and the output images path):

#include <opencv2/core/core.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/gpu/stream_accessor.hpp>
#include <iostream>

#include <cstdio>

#include <cuda_runtime.h>

#include <vector_types.h>
#include <driver_types.h>

#include <npp.h>

float* buildGaussKernel(double radius) {
    int r = (int)floor(radius + 0.5);
    int l = 2 * r + 1;
    float* h = new float[l];
    double sigma = radius;
    double sumh = 0;

    for (int i = 0; i < l; ++i) {
        double x = i - r;
        h[i] = (float)(exp (-(x * x) / (2 * sigma * sigma)));
        sumh += h[i];
    }

    for (int i = 0; i < l; ++i) h[i]  = float(h[i] /sumh);

    return h;
}

void filterImageGauss(cv::gpu::GpuMat* d_rgba, cv::gpu::GpuMat* d_dest_rgba, float* dKernel, int radius, cudaStream_t stream) {
    nppSetStream(stream);

    const Npp32s nStep = static_cast<Npp32s>(d_rgba->step);
    NppStatus sp;
    {
        NppiSize szROI = {d_rgba->size().width-(2*radius+1), d_rgba->size().height};
        int offset = radius;
        sp = nppiFilterRow_32f_C1R((float*)d_rgba->data+offset, nStep, (float*)d_dest_rgba->data+offset , nStep, szROI, dKernel, 2*radius+1, radius);
    }
    if (sp != NPP_NO_ERROR)  return;
}


int main( int argc, char** argv )
{
    cv::Mat hImageInput;
    hImageInput = cv::imread("F:\\testImage.bmp", CV_LOAD_IMAGE_GRAYSCALE);   // Read the file

    if(!hImageInput.data) // Check for invalid input
    {
        std::cout <<  "Could not open or find the image" << std::endl ;
        return -1;
    }

    cv::Mat hImage;
    hImageInput.convertTo(hImage, CV_32FC1);

    const int radiusFilter1 = 6;
    const int radiusFilter2 = 2;
    const int radiusFilter3 = 6;
    
    float* hKernel1 = buildGaussKernel(radiusFilter1);
    float* hKernel2 = buildGaussKernel(radiusFilter2);
    float* hKernel3 = buildGaussKernel(radiusFilter3);

    float* dKernel1;
    float* dKernel2;
    float* dKernel3;

    cudaMalloc(&dKernel1, sizeof(Npp32f)*(2*radiusFilter1+1));
    cudaMalloc(&dKernel2, sizeof(Npp32f)*(2*radiusFilter2+1));
    cudaMalloc(&dKernel3, sizeof(Npp32f)*(2*radiusFilter3+1));
    
    cudaMemcpy(dKernel1, hKernel1, sizeof(float)*(2*radiusFilter1+1), cudaMemcpyHostToDevice);
    cudaMemcpy(dKernel2, hKernel2, sizeof(float)*(2*radiusFilter2+1), cudaMemcpyHostToDevice);
    cudaMemcpy(dKernel3, hKernel3, sizeof(float)*(2*radiusFilter3+1), cudaMemcpyHostToDevice);

    cv::gpu::GpuMat dImage1;
    cv::gpu::GpuMat dImage2;
    cv::gpu::GpuMat dImage3;

    dImage1.upload(hImage);
    dImage2.upload(hImage);
    dImage3.upload(hImage);

    cv::gpu::Stream stream1;
    cv::gpu::Stream stream2;
    cv::gpu::Stream stream3;

    cv::gpu::GpuMat dFilteredImage1(dImage1.size().height, dImage1.size().width, CV_32FC1);
    cv::gpu::GpuMat dFilteredImage2(dImage2.size().height, dImage2.size().width, CV_32FC1);
    cv::gpu::GpuMat dFilteredImage3(dImage3.size().height, dImage3.size().width, CV_32FC1);

    cv::gpu::GpuMat dFilteredImageUchar1;
    cv::gpu::GpuMat dFilteredImageUchar2;
    cv::gpu::GpuMat dFilteredImageUchar3;

    cv::Mat hFilteredImage1[100];
    cv::Mat hFilteredImage2[100];
    cv::Mat hFilteredImage3[100];

    for (int i = 0; i<100; i++) { 
        hFilteredImage1[i].create(dImage1.size().height, dImage1.size().width, CV_8UC1);
        hFilteredImage2[i].create(dImage1.size().height, dImage1.size().width, CV_8UC1);
        hFilteredImage3[i].create(dImage1.size().height, dImage1.size().width, CV_8UC1);
    }

    cv::gpu::GpuMat diffImage1;
    cv::gpu::GpuMat diffImage2;
    cv::gpu::GpuMat diffImage3;

    for (int i = 0; i<100; i++) { 
        filterImageGauss(&dImage1, &dFilteredImage1, dKernel1, radiusFilter1, cv::gpu::StreamAccessor::getStream(stream1));
        filterImageGauss(&dImage2, &dFilteredImage2, dKernel2, radiusFilter2, cv::gpu::StreamAccessor::getStream(stream2));
        filterImageGauss(&dImage3, &dFilteredImage3, dKernel3, radiusFilter3, cv::gpu::StreamAccessor::getStream(stream3));

        cv::gpu::absdiff(dImage1, dFilteredImage1, diffImage1, stream1);
        cv::gpu::absdiff(dImage2, dFilteredImage2, diffImage2, stream2);
        cv::gpu::absdiff(dImage3, dFilteredImage3, diffImage3, stream3);
        
        stream1.enqueueConvert(dFilteredImage1, dFilteredImageUchar1, CV_8UC1);
        stream2.enqueueConvert(dFilteredImage2, dFilteredImageUchar2, CV_8UC1);
        stream3.enqueueConvert(dFilteredImage3, dFilteredImageUchar3, CV_8UC1);

        stream1.enqueueDownload(dFilteredImageUchar1, hFilteredImage1[i]);
        stream2.enqueueDownload(dFilteredImageUchar2, hFilteredImage2[i]);
        stream3.enqueueDownload(dFilteredImageUchar3, hFilteredImage3[i]);
    }

    stream1.waitForCompletion();
    stream2.waitForCompletion();
    stream3.waitForCompletion();

    for (int i = 0; i<100-1; i++) {
        char num[350]="F:\\test\\image";  
        char str[10];      
        char pmp[10]=".bmp";    
        itoa(i, str, 10);
        strcat(num,str);  
        strcat(num,pmp);

        cv::imwrite(num, hFilteredImage2[i]);
    }

    return 0;
}

I also include the CMakeLists.txt file that I used for this test in case it is useful for anyone trying to reproduce the issue:

cmake_minimum_required(VERSION 2.8)

SET( THIS_PROJECT_NAME "nppFilterTest")
project( ${THIS_PROJECT_NAME} )

#########################
# OpenCV
#########################
FIND_PACKAGE( OpenCV REQUIRED )

INCLUDE_DIRECTORIES( ${OpenCV_INCLUDE_DIRS})

#########################
# CUDA
#########################
set(CUDA_TOOLKIT_ROOT_DIR "$ENV{CUDA_PATH_V7_0}") #if we have CUDA installed CUDA_PATH_Vx is defined
find_package(CUDA REQUIRED)

option(CUDA_MAXWELL_COMPATIBILITY "Build engine with maxwell support (CC 5.2)" OFF)
if(CUDA_MAXWELL_COMPATIBILITY)
   set(CUDA_NVCC_FLAGS  -gencode=arch=compute_30,code=sm_30  -gencode=arch=compute_52,code=sm_52  -gencode=arch=compute_52,code=compute_52)
else()
   set(CUDA_NVCC_FLAGS  -gencode=arch=compute_30,code=sm_30)
endif()

set(CUDA_64_BIT_DEVICE_CODE ON)
include_directories("${CUDA_INCLUDE_DIRS}"
                    "$ENV{NVTOOLSEXT_PATH}/include")
link_directories("$ENV{NVTOOLSEXT_PATH}/lib/x64")
set(LIBS ${LIBS}  "${CUDA_CUDA_LIBRARY}"
                  "${CUDA_LIBRARIES}"
                  "${CUDA_nppc_LIBRARY}"
                  "${CUDA_nppi_LIBRARY}"
                  nvToolsExt64_1)

SET (SOURCE_FILES
main.cpp
)

SET (HEADER_FILES
)

ADD_EXECUTABLE( ${THIS_PROJECT_NAME}  ${SOURCE_FILES}  ${HEADER_FILES} )
TARGET_LINK_LIBRARIES( ${THIS_PROJECT_NAME}  ${OpenCV_LIBS} ${LIBS} )

If you can provide an example that doesn’t depend on OpenCV I can take a look.

Hi again txbob!

Thanks for your interest. Here you have a version of the code without OpenCV dependencies. This is the main.cpp file:

#include <iostream>

#include "lodepng.h"

#include <cstdio>

#include <cuda_runtime.h>

#include <vector_types.h>
#include <driver_types.h>

#include <npp.h>

float* buildGaussKernel(double radius) {
    int r = (int)floor(radius + 0.5);
    int l = 2 * r + 1;
    float* h = new float[l];
    double sigma = radius;
    double sumh = 0;

    for (int i = 0; i < l; ++i) {
        double x = i - r;
        h[i] = (float)(exp (-(x * x) / (2 * sigma * sigma)));
        sumh += h[i];
    }

    for (int i = 0; i < l; ++i) h[i]  = float(h[i] /sumh);

    return h;
}

void convertGreenUcharToGrayFloat(std::vector<unsigned char> colorImg, float* grayImg, int width, int height) {
    for (int i=0; i<width; i++) {
        for (int j=0; j<height; j++) {
            grayImg[i + j*width] = static_cast<float>(colorImg[4*(i + j*width)+1]);
        }
    }
}

void convertGrayFloatToColorUchar(float* grayImg, unsigned char* colorImg, int width, int height) {
    for (int i=0; i<width; i++) {
        for (int j=0; j<height; j++) {
            colorImg[4*(i + j*width)] = static_cast<unsigned char>(grayImg[i + j*width]);
            colorImg[4*(i + j*width) + 1] = static_cast<unsigned char>(grayImg[i + j*width]);
            colorImg[4*(i + j*width) + 2] = static_cast<unsigned char>(grayImg[i + j*width]);
        }
    }
}

void filterImageGauss(float* d_rgba, float* d_dest_rgba, float* dKernel, int radius, int width, int height, cudaStream_t stream) {
    nppSetStream(stream);

    const Npp32s nStep = static_cast<Npp32s>(width*sizeof(float));
    NppStatus sp;
    {
        NppiSize szROI = {width-(2*radius+1), height};
        int offset = radius;
        sp = nppiFilterRow_32f_C1R(d_rgba+offset, nStep, d_dest_rgba+offset , nStep, szROI, dKernel, 2*radius+1, radius);
    }
    if (sp != NPP_NO_ERROR)  return;
}


int main( int argc, char** argv )
{
    std::vector<unsigned char> in_image;
    lodepng::State state;
    std::vector<unsigned char> input_file;
    unsigned width = 1920;
    unsigned height = 1080;
    lodepng::load_file(input_file, "F:\testImage.png"); //load the image file with given filename
    unsigned error = lodepng::decode(in_image, width, height, state, input_file);

    if(in_image.size() == 0)
    {
        std::cout <<  "Could not open or find the image" << std::endl ;
        return -1;
    }

    float* hImage=0;
    hImage = new float[width*height];
    convertGreenUcharToGrayFloat(in_image, hImage, width, height);
    
    const int radiusFilter1 = 6;
    const int radiusFilter2 = 2;
    const int radiusFilter3 = 6;
    
    float* hKernel1 = buildGaussKernel(radiusFilter1);
    float* hKernel2 = buildGaussKernel(radiusFilter2);
    float* hKernel3 = buildGaussKernel(radiusFilter3);

    float* dKernel1;
    float* dKernel2;
    float* dKernel3;

    cudaMalloc(&dKernel1, sizeof(Npp32f)*(2*radiusFilter1+1));
    cudaMalloc(&dKernel2, sizeof(Npp32f)*(2*radiusFilter2+1));
    cudaMalloc(&dKernel3, sizeof(Npp32f)*(2*radiusFilter3+1));
    
    cudaMemcpy(dKernel1, hKernel1, sizeof(float)*(2*radiusFilter1+1), cudaMemcpyHostToDevice);
    cudaMemcpy(dKernel2, hKernel2, sizeof(float)*(2*radiusFilter2+1), cudaMemcpyHostToDevice);
    cudaMemcpy(dKernel3, hKernel3, sizeof(float)*(2*radiusFilter3+1), cudaMemcpyHostToDevice);

    float* dImage1;
    float* dImage2;
    float* dImage3;

    cudaMalloc(&dImage1, sizeof(float)*(width*height));
    cudaMalloc(&dImage2, sizeof(float)*(width*height));
    cudaMalloc(&dImage3, sizeof(float)*(width*height));

    cudaMemcpy(dImage1, hImage, width*height*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(dImage2, hImage, width*height*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(dImage3, hImage, width*height*sizeof(float), cudaMemcpyHostToDevice);

    cudaStream_t stream1;
    cudaStream_t stream2;
    cudaStream_t stream3;

    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);
    cudaStreamCreate(&stream3);

    float* dFilteredImage1;
    float* dFilteredImage2;
    float* dFilteredImage3;

    cudaMalloc(&dFilteredImage1, sizeof(float)*(width*height));
    cudaMalloc(&dFilteredImage2, sizeof(float)*(width*height));
    cudaMalloc(&dFilteredImage3, sizeof(float)*(width*height));

    float* hFilteredImage1[100];
    float* hFilteredImage2[100];
    float* hFilteredImage3[100];

    for (int i = 0; i<100; i++) { 
        hFilteredImage1[i] = new float[width*height];
        hFilteredImage2[i] = new float[width*height];
        hFilteredImage3[i] = new float[width*height];
    }

    for (int i = 0; i<100; i++) { 
        filterImageGauss(dImage1, dFilteredImage1, dKernel1, radiusFilter1, width, height, stream1);
        filterImageGauss(dImage2, dFilteredImage2, dKernel2, radiusFilter2, width, height, stream2);
        filterImageGauss(dImage3, dFilteredImage3, dKernel3, radiusFilter3, width, height, stream3);

        cudaMemcpyAsync(hFilteredImage1[i], dFilteredImage1, width*height*sizeof(float), cudaMemcpyDeviceToHost, stream1);
        cudaMemcpyAsync(hFilteredImage2[i], dFilteredImage2, width*height*sizeof(float), cudaMemcpyDeviceToHost, stream2);
        cudaMemcpyAsync(hFilteredImage3[i], dFilteredImage3, width*height*sizeof(float), cudaMemcpyDeviceToHost, stream3);
    }
    cudaDeviceSynchronize();

    unsigned char* hFilteredImageU1 = new unsigned char[4*width*height];
    unsigned char* hFilteredImageU2 = new unsigned char[4*width*height];
    unsigned char* hFilteredImageU3 = new unsigned char[4*width*height];

    for (int i = 0; i<100; i++) { 
        convertGrayFloatToColorUchar(hFilteredImage1[i], hFilteredImageU1, width, height);
        convertGrayFloatToColorUchar(hFilteredImage2[i], hFilteredImageU2, width, height);
        convertGrayFloatToColorUchar(hFilteredImage3[i], hFilteredImageU3, width, height);

        char num[350]="F:\test\image";  
        char str[10];      
        char pmp[10]=".png";    
        itoa(i, str, 10);
        strcat(num,str);  
        strcat(num,pmp);

        error = lodepng::encode(num, hFilteredImageU2, width, height);
    }

    return 0;
}

The CMakeLists.txt file:

cmake_minimum_required(VERSION 2.8)

SET( THIS_PROJECT_NAME "nppFilterTest")
project( ${THIS_PROJECT_NAME} )

#########################
# CUDA
#########################
set(CUDA_TOOLKIT_ROOT_DIR "$ENV{CUDA_PATH_V7_0}") #if we have CUDA installed CUDA_PATH_Vx is defined
find_package(CUDA REQUIRED)

option(CUDA_MAXWELL_COMPATIBILITY "Build engine with maxwell support (CC 5.2)" OFF)
if(CUDA_MAXWELL_COMPATIBILITY)
   set(CUDA_NVCC_FLAGS  -gencode=arch=compute_30,code=sm_30  -gencode=arch=compute_52,code=sm_52  -gencode=arch=compute_52,code=compute_52)
else()
   set(CUDA_NVCC_FLAGS  -gencode=arch=compute_30,code=sm_30)
endif()

set(CUDA_64_BIT_DEVICE_CODE ON)
include_directories("${CUDA_INCLUDE_DIRS}"
                    "$ENV{NVTOOLSEXT_PATH}/include")
link_directories("$ENV{NVTOOLSEXT_PATH}/lib/x64")
set(LIBS ${LIBS}  "${CUDA_CUDA_LIBRARY}"
                  "${CUDA_LIBRARIES}"
                  "${CUDA_nppc_LIBRARY}"
                  "${CUDA_nppi_LIBRARY}"
                  nvToolsExt64_1)

SET (SOURCE_FILES
main.cpp
lodepng.cpp
)

SET (HEADER_FILES
lodepng.h
)

ADD_EXECUTABLE( ${THIS_PROJECT_NAME}  ${SOURCE_FILES}  ${HEADER_FILES} )
TARGET_LINK_LIBRARIES( ${THIS_PROJECT_NAME} ${LIBS} )

To load and save images from disk without OpenCV, I have included the lodepng.cpp and lodepng.h files in the project. You can download them here http://lodev.org/lodepng/.

I tried the code with this image, that I downloaded and renamed to “testImage.png”:

https://www.google.es/search?q=images&espv=2&biw=1920&bih=935&tbm=isch&source=lnt&tbs=isz:ex,iszw:1920,iszh:1080#imgrc=Ao8iqkc_YpUEjM%3A

I hope this helps to reproduce the issue.

I have reproduced the issue and have passed it off to the NPP development teams to take a look.

There may be a bug, it’s not clear to me yet. I don’t have any further information at this time. If further information develops, I will report it back here.

Thanks so much txbob!

For the moment we will be using custom code to avoid this “flashing issue”. However, if NPP development team is able to locate and solve the bug (in case this is a bug) we will use NPP filtering methods again.

It seems that any time you issue a nppSetStream() call that in fact changes the underlying stream, you will need to issue a cudaDeviceSynchronize() call (first), before issuing the nppSetStream() call.

With that change, I believe your code will run correctly.

NPP can return control to the host thread before underlying CUDA calls are complete (it is effectively, in some cases, an asynchronous API). The nppSetStream() call currently takes immediate effect, and so can impact the behavior of previously queued underlying CUDA calls. By issuing the cudaDeviceSynchronize() call, this possibility is eliminated.

At the moment, this appears to be an oversight in the NPP documentation, so I am expecting that there will be (at least) a documentation update to identify this requirement at the next major CUDA release.

I acknowledge this may have a variety of implications for effective use of streams with NPP and performance, but I won’t be able to address any of those topics at this time. The above suggestion should restore correctness of behavior.