I profile (see pictures bellow) a program with the code like the bellow one. There is a time gap between tudo1 and cuFFT that I can not explain. Any clue? Is a CUDA FFT bug?
I am getting good results. Only the runtime is too high.
seems that you still complain about the pesky cudafree you can not explain/ place
you should have posted an update under your prior post, such that one can more easily interpret the latest results and make new suggestions
i think this is what is known thus far:
a) the cudafree call is successful; else the program should crash; if you feel this is too presumptuous, you could always step the program with the debugger; the debugger should at least make note if the cudafree fails
b) there is the proposition (hypothesis) that the cudafree is indirectly called by your program, rather than you directly calling cudafree - you are calling some api than requires a device memory allocation, and that subsequently needs to clean up after itself
what now catches my eye, is that the cudafree call occurs for a single thread, as opposed to each thread
this might imply that it originates from an api call prior to instigating your numerous openmp threads, rather than an api within your openmp threads, and it may very well relate to your usage of openmp
therefore, these would be my suggestions now:
as a test case, only launch a single thread, as opposed to the numerous threads you launch, and have it as an ordinary thread, rather than an openmp thread; profile and see if the cudafree is still present
in other words, only launch a single task, such that you do not need mechanisms like openmp or streams, and use it as a test case
alternatively, as an alternative to using openmp, use a single thread, create a number of streams, and issue all your work in streams, as opposed to issuing the tasks/ work within openmp threads; profile and see if the cudafree is still present
or, use the debugger and step your program; attempt to step into each api call you make, prior to launching the openmp threads, and note if it contains a memory allocation; i have not yet attempted to step into cuda apis, i do not exactly know if it is indeed possible