[Performance] I cannot get better performance with OpenCV GPU-accelerated API.

[Problem]
I cannot get better performance with OpenCV GPU-accelerated API than OpenCV normal API.
For example, regarding detecting corner by using FAST algorithm, GPU-accelerated API takes
around 10 times slower than normal API.
Of course, I skipped 1st GPU-accelerated API call because it takes very long time.
Is there any mistakes on my measurement ?
Could you please help me ?

[Condition]
Target board : nVIDIA JETSON TX1
JetPack : 2.2.1 for L4T
OS : Ubuntu 14.04 LTS
OpenCV : libopencv4tegra-repo_2.4.13_arm64_l4t-r24.deb
Input : logo.png http://opencv.org/wp-content/themes/opencv/images/logo.png

[Measurement steps by step]

(1) execute script in order to setup CPU and GPU clock maximum.
https://devtalk.nvidia.com/default/topic/901337/jetson-tx1/cuda-7-0-jetson-tx1-performance-and-benchmarks/post/4747186/#4747186
(2) execute Test program.

[Test source code]

#include <stdio.h>
#include <opencv2/opencv.hpp>
#include <opencv2/gpu/gpu.hpp>
 
using namespace std;

#define TEST_LOOP     (1000)

int main (int argc, char* argv[])
{
 
    int gpu_count = cv::gpu::getCudaEnabledDeviceCount();
    if (0 < gpu_count)
    {
 
        // preparation      
        cv::Mat src_host = cv::imread("logo.png");
        cv::Mat gray_host;
        cv::cvtColor(src_host, gray_host, CV_RGB2GRAY);
 
        // skip 1st call gpu api because 1st call might consume lots of time        
        cv::gpu::GpuMat src, dst;
        cv::gpu::FAST_GPU fastGpu(20, true);
        cv::gpu::GpuMat keypoints_gpu;
        src.upload(gray_host);
        fastGpu(src, cv::gpu::GpuMat(), keypoints_gpu);

// measure cv::FAST start
        clock_t cpu_time_used;
        cpu_time_used = clock();
 
        vector<cv::KeyPoint> keypoints;
        for (int j = 0; j < TEST_LOOP; j++)
        {
            cv::FAST(gray_host, keypoints, 20, true);
        }
 
        // measure cv::FAST end
        cpu_time_used = clock() - cpu_time_used;
        std::cout << "cv::FAST          : " << ((double) cpu_time_used) / CLOCKS_PER_SEC << " sec" << endl;

// measure cv::gpu::FAST_GPU start
        cpu_time_used = clock();
 
        for (int i = 0; i < TEST_LOOP; i++)
        {
            src.upload(gray_host);
 
            fastGpu(src, cv::gpu::GpuMat(), keypoints_gpu);
        }
 
        // measure cv::gpu::FAST_GPU end
        cpu_time_used = clock() - cpu_time_used;
        std::cout << "cv::gpu::FAST_GPU : " << ((double) cpu_time_used) / CLOCKS_PER_SEC << " sec" << endl;
    }
    else
    {
        std::cout << "no gpu" << endl;
    }
    return 0;
}

[Result of Test program]

ubuntu@tegra-ubuntu:~/hoge/sample_opencv_app/bin$ ./sample
	cv::FAST          : 0.138623 sec
	cv::gpu::FAST_GPU : 1.36099 sec

Hello,
Would you please test this code by a large image?
I tried a 16M pixel image, and GPU costs only 1/4 time of CPU.
GPU processing have more extra overhead, and it’s better to deal with big data.

br
Chenjian

Hi Chenjian-san,

Thank you for your reply.

Your information is very useful for me.
I confirmed that GPU costs less than CPU in case of using big data.

[Input]
https://upload.wikimedia.org/wikipedia/commons/4/45/Cliparts_%28examples%29.png

[Result]

ubuntu@tegra-ubuntu:~/hoge/sample_opencv_app/bin$ ./sample
cv::FAST          : 25.5326 sec
cv::gpu::FAST_GPU : 3.62653 sec
ubuntu@tegra-ubuntu:~/hoge/sample_opencv_app/bin$

Thanks a lot,
makotoqnb

hi,

I tried your code in my computer. When I build this project in Nsight, it said that:

make all -C /home/liang/cuda_workspace/fbflow_gpu/Debug 
make: Entering directory `/home/liang/cuda_workspace/fbflow_gpu/Debug'
Building target: fbflow_gpu
Invoking: NVCC Linker
/usr/local/cuda-6.5/bin/nvcc --cudart static -L/opt/opencv/2.4.9/armv7l/lib -ccbin /usr/bin/arm-linux-gnueabihf-g++-4.8 --relocatable-device-code=false -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_20 --target-cpu-architecture ARM -m32 -link -o  "fbflow_gpu"  ./info.o ./main.o ./test.o  ./source/rgb2gray/rgb2gray.o ./source/rgb2gray/rgb2gray_caller.o  ./source/mean/mean_n.o  ./source/Negate/negate.o  ./source/FSAT_CPU/fast.o ./source/FSAT_CPU/fast_9.o ./source/FSAT_CPU/nonmaxt.o  ./source/FAST/FAST.o ./source/FAST/FAST_9_caller.o   -lopencv_highgui -lopencv_features2d -lopencv_core -lopencv_imgproc
./main.o: In function `main':
/home/liang/cuda_workspace/fbflow_gpu/Debug/../main.cpp:122: undefined reference to `cv::gpu::FAST_GPU::FAST_GPU(int, bool, double)'
make: Leaving directory `/home/liang/cuda_workspace/fbflow_gpu/Debug'
/home/liang/cuda_workspace/fbflow_gpu/Debug/../main.cpp:125: undefined reference to `cv::gpu::FAST_GPU::operator()(cv::gpu::GpuMat const&, cv::gpu::GpuMat const&, cv::gpu::GpuMat&)'
/home/liang/cuda_workspace/fbflow_gpu/Debug/../main.cpp:140: undefined reference to `cv::gpu::FAST_GPU::operator()(cv::gpu::GpuMat const&, cv::gpu::GpuMat const&, cv::gpu::GpuMat&)'
collect2: error: ld returned 1 exit status
make: *** [fbflow_gpu] Error 1
> Shell Completed (exit code = 2)

did you meet this problem?

liang

ops, i have solve this problem. I add opencv_gpu to the library, and it worked!