Use NPP library for JPEG DCT

Hi,

I use NPP library to do JPEG encoding on a YCbCr domain image (Resolution is 1024 x 768 and Y, Cb, Cr plane is separated).

I call nppiDCTQuantFwd8x8LS_JPEG_8u16s_C1R (*pSrc, srcStep, *pDst, dstStep, *pQuantFwdTable, oSizeROI) function.

The first parameter *pSrc is 8x8 block size source image and set the oSizeROI.width = 8, oSizeROI.height = 8.

The total times to do the DCT convert is (1024/8) * (768/8) = 12288. It is too slow.

I want to speed up the DCT convert, so I use 1024x768 buffer size pointer as the first parameter and set
oSizeROI.widt =1024, oSizeROI.height=768.

But the JPEG image is wrong. How do I specify the first parameter *pSrc and oSizeROI ?

Eric Wu

Hi Eric,

Sorry this is not the answer u are expecting. But Instead i’am asking you a doubt.
How to use this NPP library with CUDA.
Where should we call the NPP functions? can we call it inside Kernel definition.
My requirement is for performing affine transformation on an image.

Thanks in Advance

Hi Eric,

Did you find out how to perform multiple MCU rows at at time?

I setup the ROI in the same way that you have except that I was only tying to perform the DCT for 4 MCU rows at a time i.e. ROI width was set to 1280 & height was set to 4 * 8 (assuming 444 sampling).

In order to verify that it didn’t work I performed a cuda memset on the memory to 0 before calling the DCT function. The result I get is that the only every 4th MCU line down the image appears to be correct when I view the ouptut jpeg file.

I am using CUDA 4.1 RC2 on Ubuntu 10.04.2 with a GTX550 Ti graphics card & Q9650 cpu.

If I setup the ROI to just encode one MCU row at a time then I get a correct image out, however this does not give me enough performance (nvvp shows the graphics card as mostly idle with a lot of time spent in the API).

Is this a bug in the NPP JPEG forward DCT? Or is there something differenct I need to specifiy in the src step etc?

Thanks in advance,
Simon Davidson.

Hi Eric,

Did you find out how to perform multiple MCU rows at at time?

I setup the ROI in the same way that you have except that I was only tying to perform the DCT for 4 MCU rows at a time i.e. ROI width was set to 1280 & height was set to 4 * 8 (assuming 444 sampling).

In order to verify that it didn’t work I performed a cuda memset on the memory to 0 before calling the DCT function. The result I get is that the only every 4th MCU line down the image appears to be correct when I view the ouptut jpeg file.

I am using CUDA 4.1 RC2 on Ubuntu 10.04.2 with a GTX550 Ti graphics card & Q9650 cpu.

If I setup the ROI to just encode one MCU row at a time then I get a correct image out, however this does not give me enough performance (nvvp shows the graphics card as mostly idle with a lot of time spent in the API).

Is this a bug in the NPP JPEG forward DCT? Or is there something differenct I need to specifiy in the src step etc?

Thanks in advance,
Simon Davidson.

I’m not familiar with your problem, but I have been able to use the forward and inverse JPEG DCT functions successfully using CUDA 4.0. I posted some proof-of-concept code in another post, but it was for CUDA 3.2: http://forums.nvidia.com/index.php?showtopic=191896&st=0&p=1185259&

If you’re still having trouble, I can post my functional CUDA 4.0 code, but if i recall, it’s the same.

The issue I was having was just around the ROI for the forward DCT. I can process an ROI with a size of 8 x 8 or 1280 x 8 ok, only when I specify 1280 (width) x 32 (height) that things do not work correctly.

When the ROI is set for 1280 x 32 I only get valid DCTs output for the first quarter (1280 x 8) of the source image, the rest are not calculated, but there is no error. The source image I am using is 1280 x 960. Thus I can only process one MCU row at at time. This does not give me enough performance.

This problem may be spcific to the 4.1 RC or to Ubuntu 10.04 64 bit.

Unless your functional CUDA 4.0 code uses an ROI with a height > 8 I don’t think it will help me. I am surprised that you get enough performance from the card when only processing each 8 x 8 region separately.

PS. The quant table setups seem fixed in 4.1, I checked the value output. Your proof of concept sample code was very useful as a starting point when I first started looking at this.

For my application, I’m converting a C++ program to use NPP and some custom CUDA which operates on gray scale images. I load a jpeg as coefficients, inverse DCT it into pixels, then forward DCT the pixels cropped by 4 on each side (which is the reason for the offset in the d_Pixels_row array) into coefficients:

DCTROI.width = NumberOfDCTBlocks.width * 64;

DCTROI.height = NumberOfDCTBlocks.height;

CroppedPixROI.width = NumberOfCroppedDCTBlocks.width * 8;

CroppedPixROI.height = NumberOfCroppedDCTBlocks.height * 8;

NPP_CHECK_NPP(nppiDCTQuantInv8x8LS_JPEG_16s8u_C1R(d_DCTs, NumberOfDCTBlocks.width * 64 * sizeof(Npp16s), d_Pixels_row, NumberOfDCTBlocks.width * 8 * sizeof(Npp8u), d_InvQuantTable, DCTROI));

NPP_CHECK_NPP(nppiDCTQuantFwd8x8LS_JPEG_8u16s_C1R(&d_Pixels_row[4 * NumberOfDCTBlocks.width * 8 + 4], NumberOfDCTBlocks.width * 8 * sizeof(Npp8u), d_CroppedDCTs, NumberOfCroppedDCTBlocks.width * 64 * sizeof(Npp16s), d_FwdQuantTable, CroppedPixROI));

Thank you for posting this code. I can now see where I was going wrong.

Where you set the destination step size to ‘NumberOfCroppedDCTBlocks.width * 64 * sizeof(Npp16s)’ I was only setting a value of ‘64 * sizeof(Npp16s)’.

I now set a value of ‘hsampling * m_numxMCU * 64 * sizeof(Npp16s)’ and it all works just fine.

Is it possible for me to do JPEG encoding using NPP in C?

Yes, and there is a CUDA sample code that shows how.

[url]CUDA Samples :: CUDA Toolkit Documentation