Problem when using NPP libirary, nppiMinIndx_32f_C1R()

564810049 · July 27, 2018, 9:22am

Hello, I’m using NPP for image processing. Today when I use nppiMinIndx_32f_C1R(). I get the min value from a row and let the whole row subtract the min value. The min value is invalid which leads to the error in nppiSubC_32f_C1R. I work on that for copule of hours but still cannot solve it. Can anyone help me?

Here is my code:

Update: Sorry for the error in the code, the code has been updated now.

#include “device_launch_parameters.h”

#include <stdio.h>

#include <windows.h>
#include<cuda_runtime.h>
#include<nppi.h>
#include<npp.h>

#include
//#include <Exceptions.h>
#include <nppi_morphological_operations.h>

#include <string.h>
#include
#include <helper_string.h>
#include<helper_cuda.h>

#include<ImageIO.h>
#include<ImagesNPP.h>
#include<ImagesCPU.h>
void main()
{
try
{

	//cudaDeviceInit(argc, (const char **)argv);
	std::string file_src(".\\test3_corr.jpg");
	npp::ImageCPU_8u_C1 Host_Src;

	//read image from disk to Cpu
	npp::loadImage(file_src, Host_Src);
	npp::ImageCPU_8u_C1 Host_Dst(Host_Src.size());

	//Copy image from Cpu to Gpu
	npp::ImageNPP_8u_C1 Device_Src_8u(Host_Src);
	npp::ImageNPP_8u_C1 Device_Dst_8u(Device_Src_8u.size());
	npp::ImageNPP_32f_C1 Device_Src_32f(Device_Src_8u.size());
	npp::ImageNPP_32f_C1 Device_test_32f(Device_Src_8u.size());
	npp::ImageNPP_16u_C1 Device_test_16u(Device_Src_8u.size());
	npp::ImageNPP_8u_C1 Device_test_8u(Device_Src_8u.size());
	//-----------------------------Nonuniformity_Correction-----------------------------------//
	npp::ImageNPP_32f_C1 Device_column_mean(Device_Src_8u.width(), 1);
	npp::ImageNPP_32f_C1 Device_column_sub(Device_Src_8u.width(), 1);
	NppiSize column_range_size = { (int)Device_column_mean.width(), 1 };
	int min_buffer_size;
	Npp8u* min_buffer = 0;
	Npp32f min_val = 0;
	int minx = 0;
	int miny = 0;
	NppiSize Src_size = { (int)Device_Src_8u.width(), (int)Device_Src_8u.height() };
	NppiSize roi_size = { (int)Device_Src_8u.width(), (int)Device_Src_8u.height() };

	//-----------------------------Nonuniformity_Correction-----------------------------------//



	//------------------------------Declare Over-------------------------//


	//-----------------------------Nonuniformity_Correction-----------------------------------//
	nppiSumWindowColumn_8u32f_C1R(
		Device_Src_8u.data(), Device_Src_8u.pitch(),
		Device_column_mean.data(), Device_column_mean.pitch(),
		column_range_size, Device_Src_8u.height(), 0);
	nppiDivC_32f_C1IR(
		Device_Src_8u.height(),
		Device_column_mean.data(), Device_column_mean.pitch(),
		column_range_size
	);
	nppiMinIndxGetBufferHostSize_32f_C1R(column_range_size, &min_buffer_size);
	cudaMalloc((void**)(&min_buffer), min_buffer_size);
	nppiMinIndx_32f_C1R(Device_column_mean.data(), Device_column_mean.pitch(),
		column_range_size, min_buffer, &min_val, &minx, &miny);
	printf("%.2f", min_val);
	printf("%.2f", minx);
	printf("%.2f", miny);
	printf("%d", sizeof(column_range_size));
	system("pause");
	auto kkk = nppiSubC_32f_C1R(
		Device_column_mean.data(), Device_column_mean.pitch(),
		min_val,
		Device_column_sub.data(), Device_column_sub.pitch(),
		{ (int)Device_column_mean.width(), 1 }
	);
	system("pause");
	Device_column_mean.copyTo(Device_test_32f.data(), Device_test_32f.pitch());

	// For test
	nppiConvert_32f8u_C1R(
		Device_test_32f.data(), Device_test_32f.pitch(),
		Device_test_8u.data(), Device_test_8u.pitch(),
		roi_size, NPP_RND_FINANCIAL);
	Device_test_8u.copyTo(Host_Dst.data(), Host_Dst.pitch());
	//Save image
	npp::saveImage(".\\test3_test.pgm", Host_Dst);
	exit(EXIT_SUCCESS);
	// Helper function for using CUDA to add vectors in parallel.
}
catch (npp::Exception &rException)
{
	std::cerr << "Program error! The following exception occurred: \n";
	std::cerr << rException << std::endl;
	std::cerr << "Aborting." << std::endl;
	system("pause");
	exit(EXIT_FAILURE);
}
catch (...)
{
	std::cerr << "Program error! An unknow type of exception occurred. \n";
	std::cerr << "Aborting." << std::endl;
	system("pause");
	exit(EXIT_FAILURE);
	//return -1;
}

}

Robert_Crovella · July 28, 2018, 4:40am

The code you have posted won’t compile.

564810049 · July 28, 2018, 1:56pm

Sorry for the error in the code, it has been updated now.

Thanks for the help!

Robert_Crovella · July 28, 2018, 5:33pm

The main problem is that you are passing host pointers for min_val, minx, and miny when they should be device pointers.

Some evidence of this is discoverable if you run your code with cuda-memcheck

It’s always good practice to run codes with cuda-memcheck when you are having trouble.

When using NPP, be aware that some pointers may need to be host pointers, some may need to be device pointers, and it’s not always easy to determine which from the documentation. So if you believe you are doing everything correctly, and cuda-memcheck reports invalid reads or writes, it is possibly because you are passing host pointers when you should be passing device pointers.

Here’s a version of your code, full test case, with this particular issue resolved:

$ cat t6.cu
#include <stdio.h>
#include<nppi.h>
#include<npp.h>
#include<iostream>
#include <nppi_morphological_operations.h>
#include <string.h>
#include <fstream>
#include <helper_string.h>
#include <helper_cuda.h>

#include<ImageIO.h>
#include<ImagesNPP.h>
#include<ImagesCPU.h>
int main()
{
        try
        {

                //cudaDeviceInit(argc, (const char **)argv);
                //std::string file_src(".\test3_corr.jpg");
                unsigned int width =64;
                unsigned int height = 64;
                npp::ImageCPU_8u_C1 Host_Src(width, height);
                Npp8u *base = Host_Src.data();
                for (int i = 0; i < Host_Src.height(); i++){
                  for (int j = 0; j < Host_Src.width(); j++)
                          base[j] = (j==0)?7:j + 5;
                  base += Host_Src.pitch();}

                //read image from disk to Cpu
                //npp::loadImage(file_src, Host_Src);
                npp::ImageCPU_8u_C1 Host_Dst(Host_Src.size());

                //Copy image from Cpu to Gpu
                npp::ImageNPP_8u_C1 Device_Src_8u(Host_Src);
                npp::ImageNPP_8u_C1 Device_Dst_8u(Device_Src_8u.size());
                npp::ImageNPP_32f_C1 Device_Src_32f(Device_Src_8u.size());
                npp::ImageNPP_32f_C1 Device_test_32f(Device_Src_8u.size());
                npp::ImageNPP_16u_C1 Device_test_16u(Device_Src_8u.size());
                npp::ImageNPP_8u_C1 Device_test_8u(Device_Src_8u.size());
                //-----------------------------Nonuniformity_Correction-----------------------------------//
                npp::ImageNPP_32f_C1 Device_column_mean(Device_Src_8u.width(), 1);
                npp::ImageNPP_32f_C1 Device_column_sub(Device_Src_8u.width(), 1);
                NppiSize column_range_size = { (int)Device_column_mean.width(), 1 };
                NppStatus err;
                int min_buffer_size;
                Npp8u* min_buffer = 0;
                Npp32f min_val = 0;
                int minx = 0;
                int miny = 0;
                NppiSize Src_size = { (int)Device_Src_8u.width(), (int)Device_Src_8u.height() };
                NppiSize roi_size = { (int)Device_Src_8u.width(), (int)Device_Src_8u.height() };

                //-----------------------------Nonuniformity_Correction-----------------------------------//

//------------------------------Declare Over-------------------------//

//-----------------------------Nonuniformity_Correction-----------------------------------//
                err = nppiSumWindowColumn_8u32f_C1R(
                                Device_Src_8u.data(), Device_Src_8u.pitch(),
                                Device_column_mean.data(), Device_column_mean.pitch(),
                                column_range_size, Device_Src_8u.height(), 0);
                if (err != NPP_NO_ERROR) printf("nppiSumWindowColumn error: %d\n", (int) err);

                err = nppiDivC_32f_C1IR(
                                Device_Src_8u.height(),
                                Device_column_mean.data(), Device_column_mean.pitch(),
                                column_range_size
                                );
                if (err != NPP_NO_ERROR) printf("nppiDivC error: %d\n", (int) err);
                err = nppiMinIndxGetBufferHostSize_32f_C1R(column_range_size, &min_buffer_size);
                if (err != NPP_NO_ERROR) printf("nppiMinIndxGetBuffer error: %d\n", (int) err);
printf("min_buffer_size = %d\n", min_buffer_size);
                cudaMalloc((void**)(&min_buffer), min_buffer_size);
                float *dmin;
                int *dminx, *dminy;
                cudaMalloc(&dmin, sizeof(float));
                cudaMalloc(&dminx, sizeof(int));
                cudaMalloc(&dminy, sizeof(int));
                err = nppiMinIndx_32f_C1R(Device_column_mean.data(), Device_column_mean.pitch(),
                                column_range_size, min_buffer, dmin, dminx, dminy);
                if (err != NPP_NO_ERROR) printf("nppiMinIndx error: %d\n", (int) err);
                cudaDeviceSynchronize();
                cudaMemcpy(&min_val, dmin, sizeof(float), cudaMemcpyDeviceToHost);
                cudaMemcpy(&minx, dminx, sizeof(int), cudaMemcpyDeviceToHost);
                cudaMemcpy(&miny, dminy, sizeof(int), cudaMemcpyDeviceToHost);
                printf("min_val: %.2f\n", min_val);
                printf("minx: %d\n", minx);
                printf("miny: %d\n", miny);
                printf("sizeof column_range_size %d\n", sizeof(column_range_size));
                NppiSize sroi =  { (int)Device_column_mean.width(), (int)Device_column_mean.height() };
                printf(" %d, %d, %d, %d, %d, %d\n", (int)Device_column_mean.width(), (int)Device_column_mean.height(), (int)Device_column_mean.pitch(), (int)Device_column_sub.width(), Device_column_sub.height(), Device_column_sub.pitch());
                err = nppiSubC_32f_C1R(
                                Device_column_mean.data(), Device_column_mean.pitch(),
                                min_val,
                                Device_column_sub.data(), Device_column_sub.pitch(),
                                sroi);
                if (err != NPP_NO_ERROR) printf("nppiSubC_32f_C1R error: %d\n", (int) err);
                Device_column_mean.copyTo(Device_test_32f.data(), Device_test_32f.pitch());

                // For test
                nppiConvert_32f8u_C1R(
                                Device_test_32f.data(), Device_test_32f.pitch(),
                                Device_test_8u.data(), Device_test_8u.pitch(),
                                roi_size, NPP_RND_FINANCIAL);
                Device_test_8u.copyTo(Host_Dst.data(), Host_Dst.pitch());
                //Save image
                //npp::saveImage(".\test3_test.pgm", Host_Dst);
                exit(EXIT_SUCCESS);
                // Helper function for using CUDA to add vectors in parallel.
        }
        catch (npp::Exception &rException)
        {
                std::cerr << "Program error! The following exception occurred: \n";
                std::cerr << rException << std::endl;
                std::cerr << "Aborting." << std::endl;
                exit(EXIT_FAILURE);
        }
        catch (...)
        {
                std::cerr << "Program error! An unknow type of exception occurred. \n";
                std::cerr << "Aborting." << std::endl;
                exit(EXIT_FAILURE);
                //return -1;
        }
}
$ nvcc -o t6 t6.cu -I/usr/local/cuda/samples/common/inc -I/usr/local/cuda/samples/7_CUDALibraries/common/UtilNPP -I/usr/local/cuda/samples/7_CUDALibraries/common/FreeImage/include -L /usr/local/cuda/samples/7_CUDALibraries/common/FreeImage/lib/linux/x86_64 -lfreeimage -lnppidei -lnppim -lnppisu -lnppial -lnppif -lnppist
/usr/local/cuda/samples/7_CUDALibraries/common/FreeImage/lib/linux/x86_64/libfreeimage.a(strenc.o): In function `StrIOEncInit':
strenc.c:(.text+0x1b25): warning: the use of `tmpnam' is dangerous, better use `mkstemp'
$ cuda-memcheck ./t6
========= CUDA-MEMCHECK
min_buffer_size = 24
min_val: 6.00
minx: 1
miny: 0
sizeof column_range_size 8
 64, 1, 512, 64, 1, 512
========= ERROR SUMMARY: 0 errors
$

564810049 · July 29, 2018, 2:34pm

Thanks! It helps me a lot!
One more little question. I use Npp32f or Npp8u for the pointer, I thought it always represents a Device Pointer. So is it means when I use Device Pointer, I should always use cudaMemcpy with cudaMemcpyDeviceToHost or something else to distinguish whether it is a Device Pointer or Host Pointer?

Robert_Crovella · July 29, 2018, 8:23pm

In C or C++, an (ordinary) pointer is simply a number or address. It contains no information that explicitly distinguishes it as being a host pointer or device pointer.

The difference lies in how you use it. For example, how you allocate storage for it. If you allocate storage using a host allocator, such as malloc or new, then it is implicitly a host pointer. If you allocate storage for it using a device allocator, such as cudaMalloc, it is implicitly a device pointer.

This concept isn’t really specific to NPP, and conceptually is not really even specific to CUDA. It depends on a fundamental understanding of what a pointer is in C or C++.

564810049 · July 30, 2018, 8:03am

I had a wrong concept before, where I thought Npp32f or Npp8u is inherently a device data type.

Meanwhile, I find you said “some pointers may need to be host pointers, some may need to be device pointers”.But in the General API Conventions, it is said, “The design of all the NPP functions follows the same guidelines as other NVIDIA CUDA libraries like cuFFT and cuBLAS. That is that all pointer arguments in those APIs are device pointers.” So which is true?

The other thing is that you use a Npp32f-type Host data(“min_val”) in nppiSubC_32f_C1R() function. Obviously, the function requires a Host-type data. But is there a way that I can directly use a Device-type data(not a pointer) in the function which needs Host-type data, or the opposite, so as to decrease the communication between Host and Device?

Robert_Crovella · July 30, 2018, 2:30pm

I think the statement in the NPP manual is provably not correct.

Let’s take an example. A CUBLAS axpy call:

[url]https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-axpy[/url]

has pointer arguments for the vector data. These need to be device pointers. It also has pointer arguments for the alpha and beta parameters. These can be either host or device pointers (by default, they are host pointers).

But the NPP manual is clearly making the point that all NPP pointer arguments are device pointer arguments, and that is the case (and indeed the defect) in your code example.

So, if you like, the manual is correct and my previous statement was incorrect.

564810049 · July 31, 2018, 5:17am

Agreed. The document has a lot of defects。

Thanks again for all these help!