Barra, a CUDA-capable GPU simulator
[W]e study how architectural features such as core complexity, cache/buffer design, and fixed function units impact throughput computing workloads.
In addition, this paper also presents a fair comparison between performance on CPUs and GPUs and dispels the myth that GPUs are 100x-1000x faster than CPUs for throughput computing kernels.
EDIT: The claim of 100x speedup is really more of a comparison of a multi-threaded, SIMD implementation with regular memory accesses to a highly-tuned single-threaded, SISD implementation, with unstructured memory accesses, or perhaps an implementation in a language like CUDA versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.
...versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.
Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark
You must Log In to send a PM.
Please Log In | Register to add a comment.
Not a member? Register Now