GPU Perfomance How much GFlops???

Hi All!

The description of GPU (GF 9500GT for example) defined that GPU has ~130 GFlops speed.

I have try few functions on CUDA, bu the maximum perfomance was ~8 GFlops. The FFT from CUDA lib give me even wors result, compare to DSP. For example compare to TI C6747 (~ 3 GFlops), CUDA FFT on 9500GT have only ~1 GFlops perfomance. On my Intel Dual Core 1.8 gHz i have without any problems (with ipp) ~ 10 GFlops on filtering.
What the real perfomance of GPU, GPU with CUDA and how to get the maximum perfomance?
May be much mo efficient to use shaders directly, without CUDA? Or how the perfomance calculated for GPU?

Who have an expirience plese tell!

Thanks,
Dmitry

Ah yes, FLOPS again. This has been done so many times on this forum it’s not even funny.

Bottom line is: what you’re given in the specs is peak performance flops, a very optimistic estimation based around the assumption of absolutely perfect instruction scheduling, using the “right” arithmetic instructions (a MAD and something that can be dual issued with it, commonly a MUL), and absolutely no memory bandwidth limitation (the biggest factor).

You are likely never going to see this performance in your code. Unless you’re talking about double precision arithmetic (not on this GPU). You can get 80-90% of double precision peak performance much easier but it’s really because DP is so much slower than single precision it’s less likely to be limited by bandwidth.

CUDA FFTs are limited by bandwidth (especially if you count HtD/DtH memcpys). It only gets faster than optimized CPU libraries for HUGE datasets and sizes.

Yes! But this is most important question. Because there is no other advantages to use GPUs. Only calculation power is important on GPU. If GPU has no advantages against CPU, everibody will use CPU, because it’s easy. If i will need parralel processing i can use FPGAs.

No problem. I am ready to do it. But how? :">

For me it is not a problem reduce memory exchange, but in this case i need to use shared memory (L1 cashe) not only for one call, but for many. Normal structure of program in this case will look so:

=========

memcopy(host->device),

memcopy(device->device_shared),

@@@@ begin measure GFlops@@@@

host_call(kernel_1_<<<x,x>>>(mem_shared_in, mem_shared_out)), <---- as parameter i can use indexes of shared memory, not pointers.

host_call(kernel_2_<<<x,x>>>(mem_shared_in, mem_shared_out))

host_call(kernel_N_<<<x,x>>>(mem_shared_in, mem_shared_out))

@@@@ end measure GFlops@@@@

memcopy(device_shared->device),

memcopy(device->host)

=========

And now question: how namy GFlops will i get in this case with CUDA on my GPU (in percents from 100% decleared in specs)? :unsure:

And other question.

Is it possible somehow to allocate shared memory between the blocks? I know that every multiprocessor has inside 64Kb (float[16384]) shared memory. I whant to use this memory as buffers between calls.

Is it possible?

Thanks,

Dmitry

PS. Anyway, is there any example of CUDA code, where i can see decleared GFlops? :-)

Just take Volkov’s implementation of BLAS-3, for example, sgemm (from CUDA SDK), and you will get roughly 40% of peak performance. Another quite impressive example, take an algorithm that is memory bounded, i.e. it needs to read large data arrays, and you will see something of 20-50 times improvement compared to CPU with its slow memory. One free example can be found from our corporate site, double/quad precision CG for GPUs and CPUs.

Sincerely,

Ilgis

Elegant Mathematics Ltd.

Shared memory only has the lifetime of a block - you can’t assign to it (or read from it) from anywhere except kernel code - and even then, only for that block.

As for the GFLOPs question - it depends on the problem, and your algorithm. The GPU can be a lot less forgiving of seemingly minor flaws, since it doesn’t have large caches to hide memory latencies, and that can lead to substantial slow downs. For the most part, I don’t bother working out the FLOP count of what the GPU is doing - I look to the speed up relative to the CPU code.

I have measured some perfomance with few test scenarios, and now i have a conclusion.
I have tested with: FIR filtering, FFT, simple MAC operations and so on… Idea was to measure GFlops. And result is very bad. :-(

GPU designed for streaming (real-time) processing. It means, that user can configure structure of processing once and then execute them forever. The user can change parameters of processing, but not the structure.
CUDA, by calling function from host, simply reconfigure the core and it costs alot of perfomance. Reason for that not the memory access problems (memory working with 500 MHz*128 bit) and costs of access.

That means, CUDA is the greate idea (SIMT specialy), but this is wrong way to use the GPU. That’s why people will never get from GT200 more then 50 GFlops. CUDA can be used only if there is no way to use normal CPU (simply perfomance for free if GPU instaled but not used for 100 %).

I would say so: CUDA 2.x is a good idea with wrong implementation. If some one need a perfomance, they stiull have to program GPU with a shaders.

  1. What i would like to see from nVidia.

I would like to see solution like a mixing languge: Verilog (or VHDL) and C (C from CUDA). It means, functions you can write in C and one function executed on one shadep processor, but then you must be able to link these functions to application (or block of processing) by language like Verilog (VHDL). Then this block has to be loaded once to GPU and then host needs to trigger execution of this block (or application). You must be able to access memory of GPU in realtime to change parameters of functions, but not the structure of application.
That’s it. This is exectly the way to use GPU.

Thanks for answers,
Regards,
Dmitry

I’m not sure I understand.

Last app I wrote (something with Lyapunov fractals) achieved 150x speed-up compared to a CPU version. A lowly GeForce 8800GTS vs both cores of E5200 @ 3GHz (with SSE2). IIRC the CUDA version did around 130 GFLOPS. Even though it’s far from the peak ~500 GFLOPS my card is rated, it’s still 150x faster than whatever my CPU was able to pull with my implementation. Remember, CPUs don’t reach their peak performance easily either.

And it wasn’t even difficult to code. In fact, the compute functions were identical, I actually defined them as device host so the same code was compiled for both implementations. I’m pretty sure it could be optimised more if I really wanted to.

Your problem is that you’re fixated on GFLOPS while this is NOT a good indication of performance (not only for CUDA but in general). You should compare running times. For many applications, FLOPS are completely irrelevant and, if anything, bandwidth should be measured instead. By the way, a modern GPU’s bandwidth is about the same as the CPU’s L1 cache bandwidth (and an order of magnitude more than CPU RAM bandwidth). Peak bandwidth isn’t always the best thing to measure so the bottom line - compare running times of reasonably optimized implementations for GPU and CPU and then decide.

PS: Shaders aren’t generally faster than CUDA. In many cases they will be much slower. A shader is limited by the same hardware and will not reach peak GFLOPS either. Calling a kernel doesn’t cost much (microseconds) and is often slightly faster than calling a shader through a graphics API (I remember Volkov’s paper mentioning this) and you get more configuration options during a kernel launch than when you render a shader.

And you get direct access to shared memory. :)

I have stoped my investigations with CUDA because…

  1. Maximum what i got is a ~25% of decleared perfomance (not only me, i think nobody got more) of GPU.

  2. With ATI Strem I got 97% of perfomance (~970 GFlops) without any problems on card that are not dedicated. It is more flexible and developer frendly.

Reason for that i think is nVidia GPU structure. It make no cense to use nVidia as GPU device.

Thans & best regards,
Dmitry

You have no idea what you’re talking about.

Can you show me application on CUDA, that can reach 97% of perfomance?

No?

You will never be able to do so. (and i know why :-)).

High optimized FFT (hardcoded and so on) provides max 20% of GFlops that i can get. (see topics on forum).

And after thet you can say that i have no idea about subject?

Kid, i am in REAL development since 1996. Beleve me, i know what i am talking about… :-)

No you don’t.

And since you want to hear performance numbers; Multiple parallel convolutions on the C1060, part of a larger application: 557 GFLOP/s.

I don’ t know where your supposed knowledge and conclusions about GPUs come from, but it is obvious that you find yourself very clever. But don’ t you think it is a bit strange, that although a large number of people who are using GPUs with CUDA do reach high performance, you conclude that it is the wrong way to use a GPU? Also please enlighten us about why you can’ t reach high performance with CUDA while you can reach it with AMD’s stream sdk (should be easy as you state above that you know why).

Yes, I could show you plenty of such applications. I’ve written some myself. Damn it, there’s an nbody example included in the SDK which does 250 GFLOPS on my card, what are you trying to prove?

You haven’t even shown us what code you ran on your GPU or how you measured performance. Judging by your second post in this thread, you’ve no idea how CUDA works. Hint: there’s no memcopy from device memory to shared memory. And shared memory is not persistent through different kernels.

You are outright ignoring what has been written in this and other threads. Honestly, are you a troll? You also refuse to accept there’s a thing called memory bandwidth (which is actually LOWER in AMD cards) which is a limit that kicks in much sooner than ALU throughput, or that there are performance metrics different than FLOPS. AMD suffers the same. Perhaps your REAL development never included high performance parallel computing? Or perhaps you don’t know how many memory accesses per arithmetic instruction there are in FFT or *GEMM?

Why do you think CUDA has such a huge community compared to AMD Stream? Is this a conspiracy, are we being paid by NVIDIA or are we all just stupid and can’t tell a well performing implementation from a bad one?

OK!

Let stop the “holywar”! (Sorry that i have started it)

I made a shoice to ATI Stream because it feets to my requirements.

What i have in ATI Strem:

  1. DMA access.

  2. Common shader code. It means that i can use it together with DirectX

  3. Direct access to HW.

  4. Assembler, linker and high level language in one packege.

  5. Fast memory access in shader (cache).

  6. Event driven execution queue.

  7. The shader core more powerfull (more commands, clear structure and so on… It is like small DSP or microceel in FPGA that could be programd in assembler, not in VHDL(i heat them)).

  8. GPU fuster and chiper.

Why ATI Strem are not so popular? - Marketing.

  1. There are was no High level language in Streams.

  2. The API interface are not so comfartable as for CUDA

  3. Not all developers understand HW details.

But,… It is realy faster. May be reason, that i have wroute blocks for ATI in assembler (CAL), but anyway on CUDA I was not able to get the perfomance even with assembler.

Code example for CUDA:

// Here N in my tests was N=1024*1024

global void test(float* in, int N)

{

shared float data[512];

data[threadId] = in[threadId];

float sum =0; // you can use float4

for (int i=0 ; i< N ; i++)

{

sum += data[threadId]*data[threadId]; // simple MAD operation in queue. You can try sin/cos/div and somethink else. Be careful with optimizer. :-)

}

in[threadId] = sum;

}

extern “C” void runtest(int N, int K, int M)

{

test<<<K,M>>>(N);

}

=======

call of kernel:

for (int i=0 ; i< 512 ; i++)

{

for (int j=0 ; j< 512 ; j++)

{

starttime();

runtest(1024*1024, i, j);

checktime();

}

}

=====

For M=1 and K=1 i have got 58 mS !!!

People that have spend alot of time for FFT got only 25% of perfomance! This is not aceptebl for me. :-(

CUDA simply started from API level and provides “easy of use”, Stream started from HW level and provides freedom for development. Secont i like more.

That’s it.

The problem i think, that nVidia chips are not designet for async execution.

Finaly.

I simply have no features in CUDA that will feet to my requirements. If i would realy be able to get perfomance, then no problem, but it is a real world. :-)

Regards,

Dmitry.

PS.

I do not need a blackbox that calculats for calculations. In normal DSPs or GPUs DMA are used. in FFT there are not so much accesses to global mem.

FPGA is a high performance parallel computing?

What is perfomance of X1060?

PS.PS. Forgot to add. ATI Stream provides a real HW assembler (ISA), thay feet to HW. What we have on CUDA (PXT) it is somethink else.

deleted my post. need to think about it more.

You have a race condition in that kernel. Either that, or out of bounds shared memory accesses (depending whether threadId is simply threadIdx.x or threadIdx.x + blockDim.x*blockIdx.x).

Your timing is wrong because kernels are launched asynchronously. With that many kernels launched, it may be irrelevant because the queue fills up anyway.

You’re launching blocks which sizes aren’t multiples of warp size. That’s a performance loss out of the box.

You haven’t understood what shared memory is for. It’s not necessary in a calculation like this at all.

You just can’t code in CUDA. Don’t blame it on the hardware or the environment.

By the way, this can’t be the full code, since you’re not supplying any device pointer to test in runtest, only N.

Both NVIDIA and AMD access memory in a similar fashion - using coordinated (coalesced) reads/writes of blocks of memory. It’s absolutely the fastest way to make SIMD throughput machines.

There are plenty. Let’s do a little math.

To reach 1000 GFLOPS on an NVIDIA GPU, one has to basically pump out MAD and MUL instructions and they may be executed concurrently.

A MAD instruction is:

a = b + c*d

And a MUL

e = f * g;

There are five arguments. Now, if you wanted to sustain 1000 GFLOPS, you’d need to pull 1000 * 5 floats from memory per secod. Thats 20 000 GB/s.

A GTX 285 has memory bandwidth of 155 GB/s. An AMD HD 4890 has 122 GB/s. See the problem? And this assumes your program consists entirely of interleaves MADs and MULs and that everything is perfect (no latencies, register dependencies, perfect scheduling, you don’t write the results back etc.).

This is why real apps, be it on AMD or NVIDIA GPUs, using any programming model, shaders, assembly, whatever, usually don’t reach peak GFLOPS. FFT is an example. Unless AMD’s FFT does 1TFLOP?

Of course you could get more GFLOPS per GB/s by only working on registers/shared memory/cached data and not fetching much global memory, but real apps generally want to go through huge datasets, not small ones repeatedly. And caches don’t help here if that’s a throughput issue. Neither does DMA, it’s just a way of fetching data, it doesn’t go around the physical width of the memory bus and frequency of the memory, only allows some asynchronous access. NVIDIA cards do memory accesses asynchronously as well, automatically. Proof being that it’s relatively easy to get 80-90% of peak physical bandwidth.

In a real life situation, you’d most likely need around 200+ arithmetic instructions per global memory access to even be close to saturating the ALUs, otherwise you’ll only hit big GB/s (which is still good and might get you 100x+ speed-ups compared to CPU). This is the same for AMD cards and it’s one of the reasons why GFLOPS are neither the only nor the best performance metric there is.

Does FFT do 200+ arithmetic operations on every data element it loads? If so, for what sizes of datasets and what N?

BTW: I’ve modified your code so that it made sense. I reach 412 GFLOPS out of the max 640, effectively maxing out the hardware with 2 instructions per clock, save for the concurrent MUL that could’ve been added for third instruction per clock. I’m reaching 97% of the theoretical FLOPS possible without dual issuing MAD+MUL. Here’s the code

__global__ void test(float* in, int N)

{

	int threadId = threadIdx.x + blockDim.x*blockIdx.x;

	float data = in[threadId];

	float sum =0;

#pragma unroll 200 //pretty important for performance here

	for (int i=0; i< N; ++i)

	{

		sum = sum+data*data; 

	}

	in[threadId] = sum;

}

When I add a mul there, just for fun, I get 520 GFLOPS, reaching 81% of peak GFLOPS. Only 80% because dual issue doesn’t work perfectly in compute capability 1.1 cards. I hear that’s been worked on in 1.2+?

Thanks for code. I will try it today on GTX260 and GT9500.

About memory access.

I undestand, that bandwith of memory is not enough, that’s why i have asked is it possible to allocate shared memory.

DMA are very helpful in pipeline.

Normal procedure will be like this.

Defining application structure like with VHDL, it means connecting input and outputs of functions. You can execute few functions sequencialy, and at the end you will have a result. In parralel you can execute DMA, that will copy previous result to the glogal memory. And global memory will use different DMA to copy data to HOST. But for this i need to have low level access, but CUDA not supporting it.

Basicly, If memory frequency is 1000 MHz * 128 bit, you mus be able to reach 1000 GFlops/s if for every sample you using 200 flops. It is not so much. It should be possible to reach this limits. :-)

About your code. I do not like, that in between you are not accessing shared memory. I have an idea, that shared memory is like a cache II, but not I.

Anyway, i will report my results. :-)

Thanks and regards,

Dmitry

Why would I need shared memory there? I use registers. All variables like threadId, data, sum end up in registers by default. It’s exactly as fast as shared memory, sometimes faster.

Shared memory only makes sense if there is indeed memory to be shared among the threads of a block. Those are independent per-thread calculations. Allocating shared memory there could even lower the performance due to higher resource requirements per block.

Code I used to launch the kernel

int main(int argc, char* argv[])

{

	if(!InitCUDA()) {

		return 0;

	}

	int N = 102400;

	int size = 512*512;

	float* d_in;

	CUDA_SAFE_CALL( cudaMalloc((void**) &d_in, sizeof(d_in) * size));

	cudaMemset(d_in, 0, size);

	

	unsigned int timer = 0;

	CUT_SAFE_CALL( cutCreateTimer( &timer));

	CUT_SAFE_CALL( cutStartTimer( timer));

	

	int blockDim = 64;

	int gridDim = size / blockDim;

	test<<<gridDim, blockDim, 0>>>(d_in, N);

	CUT_CHECK_ERROR("Kernel execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	CUT_SAFE_CALL( cutStopTimer( timer));

	printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

	CUT_SAFE_CALL( cutDeleteTimer( timer));

	CUDA_SAFE_CALL( cudaFree(d_in));

	CUT_EXIT(argc, argv);

	return 0;

}

I’ve chosen a smaller N to ensure I don’t hit the watchdog timer. It would run 5 seconds for 1024x1024. There are 512*512 threads, each doing N iterations of MADs.

InitCUDA is a non-standard helper function that’s generated by CUDA VS wizard

#if __DEVICE_EMULATION__

bool InitCUDA(void){return true;}

#else

bool InitCUDA(void)

{

	int count = 0;

	int i = 0;

	cudaGetDeviceCount(&count);

	if(count == 0) {

		fprintf(stderr, "There is no device.\n");

		return false;

	}

	for(i = 0; i < count; i++) {

		cudaDeviceProp prop;

		if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {

			if(prop.major >= 1) {

				break;

			}

		}

	}

	if(i == count) {

		fprintf(stderr, "There is no device supporting CUDA.\n");

		return false;

	}

	cudaSetDevice(i);

	printf("CUDA initialized.\n");

	return true;

}

#endif

A final note about DMA/VHDL etc. - this is a question of selecting a programming model that suits one’s taste. It has little to do with speed. And will most definitely not go around hardware limitations that are roughly the same for AMD and NVIDIA GPUs.

What you describe with DMA in global memory between kernels and to/from host makes me think of CUDA’s Stream API (see Programming Guide). Device (global) memory is persistent between kernel launches and is conveniently addressable as in C (arrays). Under the hood, global memory accesses will be performed asynchronously and the hardware will hide the latency with computations if other threads are ready for processing.

Dmitry,

I’ve looked at AMD’s APIs and found those harder to use, than NVidia’s CUDA. This is the primary reason why I’m using CUDA and buying NVidia’s hardware, even though I’ve heard ATI’s boards are 20% or so faster (I can offer no evidence to this last statement). I am under the impression, that I’m in the majority (NOT just on this website, but among GPGPU developers in general) regarding the ease of use assessment.

As far as my application is concerned, memory bandwidth is not an issue, but efficient AND EASY TO CODE access to the registers and shared memory is: contrary to your FTP example, my code performs very many integer and floating point operations per global memory read/write. I don’t find brook to be a convenient tool for writing complex massively multi-threaded code against the shared memory.

I would be very interested in seeing a piece of complex numerical code (e.g. shared memory-based solver of a VERY LARGE number of different SMALL linear systems of equations, each of which fully fits into the shared memory), that works much faster on an ATI, than on NVidia hardware of the same generation. I would be particularly impressed, if this ATI code appears to be easier to implement, than the CUDA code.

I find it difficult to go on the GPGPU bandwagon, albeit I wrote assembly code (6800, Z80/8080, 8086, 286, 386, pentium, pentium 4, core micro-architecture), used many languages, have done SSE optimizations (essentially on core micro-architecture). I needed good and precise documentation, good tools, simple language to learn (or in this case re-learn), and good community support.

And moreover a GPU that is designed from the ground on to run GP applications, not one that is totally shader-oriented.

That’s why I think that ATI’s GPU may be better for some problems, but won’t be efficient on general-purpose computing.

And the idea of using assembly-code lead to problems to optimize for new GPU architectures, the same way as pentium4-optimized assembly-code won’t probably use correctly the core micro-architecture (core2 duo/quad), this will be a limiting factor when ATI will introduce architecture optimizations.

Maybe AMD/ATI will proove me wrong when introducing their OpenCL implementation for ATI’s GPU, but as far as I see it, introducing OpenCL for x86 CPU instead GPU seems to point the other direction.

Anyway, I plan to test that when when Snow Leopard will be delivered with OpenCL support for both nVidia and ATI GPU (I doubt ATI will be supported at launch time!)