AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance
  1 / 5    
Bitcoin mining is essentially SHA-256 hashing.

According to the table at [url="http://bitminer.info/"]http://bitminer.info/[/url], the Radeon 6970 ($330) is able to run bitcoin mining at 323 M-hash/s while the GTX 570 ($330) runs it at 105 M-hash/s. The Radeon is 3x faster.

An explanation for this is provided at [url="https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU#Why_are_AMD_GPUs_faster_than_Nvidia_GPUs?"]https://en.bitcoin.i...an_Nvidia_GPUs?[/url]

The explanation states that the Radeon is faster not only on SHA-256 hashing, but on "all ALU-bound GPGPU workloads". Further, it explains that Radeon has a particular advantage on SHA-256 hashing because it has an instruction for 32 bit integer right rotation, which NVIDIA GPUs do not have.

I'm wondering what the CUDA community's take on this is. Is the Radeon really faster on "all ALU-bound GPGPU workloads" at a given price point? If so, what is NVIDIA faster on?
Bitcoin mining is essentially SHA-256 hashing.



According to the table at http://bitminer.info/, the Radeon 6970 ($330) is able to run bitcoin mining at 323 M-hash/s while the GTX 570 ($330) runs it at 105 M-hash/s. The Radeon is 3x faster.



An explanation for this is provided at https://en.bitcoin.i...an_Nvidia_GPUs?



The explanation states that the Radeon is faster not only on SHA-256 hashing, but on "all ALU-bound GPGPU workloads". Further, it explains that Radeon has a particular advantage on SHA-256 hashing because it has an instruction for 32 bit integer right rotation, which NVIDIA GPUs do not have.



I'm wondering what the CUDA community's take on this is. Is the Radeon really faster on "all ALU-bound GPGPU workloads" at a given price point? If so, what is NVIDIA faster on?

#1
Posted 06/15/2011 08:54 PM   
Bitcoin mining is essentially SHA-256 hashing.

According to the table at [url="http://bitminer.info/"]http://bitminer.info/[/url], the Radeon 6970 ($330) is able to run bitcoin mining at 323 M-hash/s while the GTX 570 ($330) runs it at 105 M-hash/s. The Radeon is 3x faster.

An explanation for this is provided at [url="https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU#Why_are_AMD_GPUs_faster_than_Nvidia_GPUs?"]https://en.bitcoin.i...an_Nvidia_GPUs?[/url]

The explanation states that the Radeon is faster not only on SHA-256 hashing, but on "all ALU-bound GPGPU workloads". Further, it explains that Radeon has a particular advantage on SHA-256 hashing because it has an instruction for 32 bit integer right rotation, which NVIDIA GPUs do not have.

I'm wondering what the CUDA community's take on this is. Is the Radeon really faster on "all ALU-bound GPGPU workloads" at a given price point? If so, what is NVIDIA faster on?
Bitcoin mining is essentially SHA-256 hashing.



According to the table at http://bitminer.info/, the Radeon 6970 ($330) is able to run bitcoin mining at 323 M-hash/s while the GTX 570 ($330) runs it at 105 M-hash/s. The Radeon is 3x faster.



An explanation for this is provided at https://en.bitcoin.i...an_Nvidia_GPUs?



The explanation states that the Radeon is faster not only on SHA-256 hashing, but on "all ALU-bound GPGPU workloads". Further, it explains that Radeon has a particular advantage on SHA-256 hashing because it has an instruction for 32 bit integer right rotation, which NVIDIA GPUs do not have.



I'm wondering what the CUDA community's take on this is. Is the Radeon really faster on "all ALU-bound GPGPU workloads" at a given price point? If so, what is NVIDIA faster on?

#2
Posted 06/15/2011 08:54 PM   
[quote]Is the Radeon really faster on "all ALU-bound GPGPU workloads" at a given price point? [/quote]

I would say that Radeon is asymptotically faster on ALU-bound workloads in the limit of infinite programmer time. So if you have a particularly simple, vectorizable task where AMD compiler can do a good job, you might see 2-3x advantage right away. If you have a complicated and poorly vectorizable task, making the program faster on AMD may take substantial effort.

NVIDIA is faster on non-vectorized tasks, especially if they involve memory accesses, especially if your memory accesses are shorter than 4 bytes. For example, performing an operation with a 1-byte operand in L1 cache has no overhead on NVIDIA (as far as I know). To do same on AMD, the compiler has to generate a complicated explicit sequence of instructions, effectively making that single access take as long as 20-30 ALU instructions.
Is the Radeon really faster on "all ALU-bound GPGPU workloads" at a given price point?




I would say that Radeon is asymptotically faster on ALU-bound workloads in the limit of infinite programmer time. So if you have a particularly simple, vectorizable task where AMD compiler can do a good job, you might see 2-3x advantage right away. If you have a complicated and poorly vectorizable task, making the program faster on AMD may take substantial effort.



NVIDIA is faster on non-vectorized tasks, especially if they involve memory accesses, especially if your memory accesses are shorter than 4 bytes. For example, performing an operation with a 1-byte operand in L1 cache has no overhead on NVIDIA (as far as I know). To do same on AMD, the compiler has to generate a complicated explicit sequence of instructions, effectively making that single access take as long as 20-30 ALU instructions.

#3
Posted 06/15/2011 10:29 PM   
[quote]Is the Radeon really faster on "all ALU-bound GPGPU workloads" at a given price point? [/quote]

I would say that Radeon is asymptotically faster on ALU-bound workloads in the limit of infinite programmer time. So if you have a particularly simple, vectorizable task where AMD compiler can do a good job, you might see 2-3x advantage right away. If you have a complicated and poorly vectorizable task, making the program faster on AMD may take substantial effort.

NVIDIA is faster on non-vectorized tasks, especially if they involve memory accesses, especially if your memory accesses are shorter than 4 bytes. For example, performing an operation with a 1-byte operand in L1 cache has no overhead on NVIDIA (as far as I know). To do same on AMD, the compiler has to generate a complicated explicit sequence of instructions, effectively making that single access take as long as 20-30 ALU instructions.
Is the Radeon really faster on "all ALU-bound GPGPU workloads" at a given price point?




I would say that Radeon is asymptotically faster on ALU-bound workloads in the limit of infinite programmer time. So if you have a particularly simple, vectorizable task where AMD compiler can do a good job, you might see 2-3x advantage right away. If you have a complicated and poorly vectorizable task, making the program faster on AMD may take substantial effort.



NVIDIA is faster on non-vectorized tasks, especially if they involve memory accesses, especially if your memory accesses are shorter than 4 bytes. For example, performing an operation with a 1-byte operand in L1 cache has no overhead on NVIDIA (as far as I know). To do same on AMD, the compiler has to generate a complicated explicit sequence of instructions, effectively making that single access take as long as 20-30 ALU instructions.

#4
Posted 06/15/2011 10:29 PM   
Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficíently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.
Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficíently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.

#5
Posted 06/16/2011 02:02 PM   
Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficíently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.
Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficíently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.

#6
Posted 06/16/2011 02:02 PM   
[quote name='Jimmy Pettersson' date='16 June 2011 - 11:02 AM' timestamp='1308232967' post='1252584']
Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficíently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.
[/quote]

Is hence possible to make a faster SHA engine only with ALU?
[quote name='Jimmy Pettersson' date='16 June 2011 - 11:02 AM' timestamp='1308232967' post='1252584']

Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficíently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.





Is hence possible to make a faster SHA engine only with ALU?

#7
Posted 07/09/2011 10:33 PM   
Most HPC related problems are data-parallel and hence vectorizable. However GPGPU is moving non-traditional HPC problems to GPU. So, in those cases, achieving performance with AMD is a bit challenging - Totally depends on the problem in question..

As far as memory bandwidth, AMD can manage well even if there is lot of non-coalescedness in your program.

It aint too bad.. AMD cards can give the bang for the buck as much as NVIDIA does.. And, OpenCL is a standard anyway...

But, OpenCL does not really mitigate the portability issues. Separate kernels are sometimes needed to address AMD and NVIDIA platofmrs separately...
You may want to check Dr.Dongarra's paper on writing high performance BLAS kernels in OpenCL. Google for it. You should be able to find... Dr.Dongarra is pioneer in BLAS, LAPACK world.. He still is. One of the Best in the field.
Most HPC related problems are data-parallel and hence vectorizable. However GPGPU is moving non-traditional HPC problems to GPU. So, in those cases, achieving performance with AMD is a bit challenging - Totally depends on the problem in question..



As far as memory bandwidth, AMD can manage well even if there is lot of non-coalescedness in your program.



It aint too bad.. AMD cards can give the bang for the buck as much as NVIDIA does.. And, OpenCL is a standard anyway...



But, OpenCL does not really mitigate the portability issues. Separate kernels are sometimes needed to address AMD and NVIDIA platofmrs separately...

You may want to check Dr.Dongarra's paper on writing high performance BLAS kernels in OpenCL. Google for it. You should be able to find... Dr.Dongarra is pioneer in BLAS, LAPACK world.. He still is. One of the Best in the field.

Ignorance Rules; Knowledge Liberates!

#8
Posted 07/10/2011 03:16 PM   
[quote name='Sarnath' date='10 July 2011 - 10:16 AM' timestamp='1310311013' post='1262563']
Most HPC related problems are data-parallel and hence vectorizable. However GPGPU is moving non-traditional HPC problems to GPU. So, in those cases, achieving performance with AMD is a bit challenging - Totally depends on the problem in question..

As far as memory bandwidth, AMD can manage well even if there is lot of non-coalescedness in your program.

It aint too bad.. AMD cards can give the bang for the buck as much as NVIDIA does.. And, OpenCL is a standard anyway...

But, OpenCL does not really mitigate the portability issues. Separate kernels are sometimes needed to address AMD and NVIDIA platofmrs separately...
You may want to check Dr.Dongarra's paper on writing high performance BLAS kernels in OpenCL. Google for it. You should be able to find... Dr.Dongarra is pioneer in BLAS, LAPACK world.. He still is. One of the Best in the field.
[/quote]

Hopefully in another GPU generation or two, we'll see some architectural convergence that will make OpenCL work better across platforms. It already looks like AMD is moving in the direction of NVIDIA for their next architecture:

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
[quote name='Sarnath' date='10 July 2011 - 10:16 AM' timestamp='1310311013' post='1262563']

Most HPC related problems are data-parallel and hence vectorizable. However GPGPU is moving non-traditional HPC problems to GPU. So, in those cases, achieving performance with AMD is a bit challenging - Totally depends on the problem in question..



As far as memory bandwidth, AMD can manage well even if there is lot of non-coalescedness in your program.



It aint too bad.. AMD cards can give the bang for the buck as much as NVIDIA does.. And, OpenCL is a standard anyway...



But, OpenCL does not really mitigate the portability issues. Separate kernels are sometimes needed to address AMD and NVIDIA platofmrs separately...

You may want to check Dr.Dongarra's paper on writing high performance BLAS kernels in OpenCL. Google for it. You should be able to find... Dr.Dongarra is pioneer in BLAS, LAPACK world.. He still is. One of the Best in the field.





Hopefully in another GPU generation or two, we'll see some architectural convergence that will make OpenCL work better across platforms. It already looks like AMD is moving in the direction of NVIDIA for their next architecture:



http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute

#9
Posted 07/10/2011 05:21 PM   
Thanks for the link. Waiting to see how this whole new fusion thing will pan out...
Thanks for the link. Waiting to see how this whole new fusion thing will pan out...

Ignorance Rules; Knowledge Liberates!

#10
Posted 07/11/2011 05:15 AM   
AMD5970 - 530 MHash

Prev Generation card producing such stunning numbers? (suspect a typo)
AMD5970 - 530 MHash



Prev Generation card producing such stunning numbers? (suspect a typo)

Ignorance Rules; Knowledge Liberates!

#11
Posted 07/11/2011 07:04 AM   
[quote name='Sarnath' date='10 July 2011 - 11:04 PM' timestamp='1310367847' post='1262767']
AMD5970 - 530 MHash

Prev Generation card producing such stunning numbers? (suspect a typo)
[/quote]

No typo. 5xxx and 6xxx are the same fabrication process (40 nm), design differences are minor. In fact, in terms of #hash per watt, 5970 may be faster than 6990. The table on bitminer.info shows 5970 at 530 MHash and 294W, 6990 at 670 MHash and 346 watt, but all news sources say that 6990 _really_ eats 375 watt at full load.
[quote name='Sarnath' date='10 July 2011 - 11:04 PM' timestamp='1310367847' post='1262767']

AMD5970 - 530 MHash



Prev Generation card producing such stunning numbers? (suspect a typo)





No typo. 5xxx and 6xxx are the same fabrication process (40 nm), design differences are minor. In fact, in terms of #hash per watt, 5970 may be faster than 6990. The table on bitminer.info shows 5970 at 530 MHash and 294W, 6990 at 670 MHash and 346 watt, but all news sources say that 6990 _really_ eats 375 watt at full load.

#12
Posted 07/12/2011 12:19 AM   
Wait, isn't the 6970 a dual GPU ? 570 is a single GPU.

A fair comparison would be GTX590 vs the radeon 6970 wouldn't it?
Wait, isn't the 6970 a dual GPU ? 570 is a single GPU.



A fair comparison would be GTX590 vs the radeon 6970 wouldn't it?

#13
Posted 07/12/2011 06:57 AM   
[quote name='Jimmy Pettersson' date='12 July 2011 - 09:57 AM' timestamp='1310453844' post='1263373']
Wait, isn't the 6970 a dual GPU ?
[/quote]
No, that is the 6990.
[quote name='Jimmy Pettersson' date='12 July 2011 - 09:57 AM' timestamp='1310453844' post='1263373']

Wait, isn't the 6970 a dual GPU ?



No, that is the 6990.

#14
Posted 07/12/2011 07:15 AM   
This is unfortunate, since I am very interested in bitcoin mining AND I'd like to keep being an Nvidia guy, since my rig is intended for gaming first.

It's great that Nvidia cards are better at folding@home :)
This is unfortunate, since I am very interested in bitcoin mining AND I'd like to keep being an Nvidia guy, since my rig is intended for gaming first.



It's great that Nvidia cards are better at folding@home :)

#15
Posted 07/14/2011 06:22 PM   
  1 / 5    
Scroll To Top