MD5 hash on a GPU: Implementation and benchmarks
Hi all,
We've recently had an interesting discussion at a local HPC Users Group meeting, on how (in theory) GPUs should be perfect for a range of cryptanalysis/cryptography applications. So last weekend I wrote a simple CUDA MD5 hash computation routine (based on RSA's Md5.c source) to [url="http://majuric.org/software/cudamd5/"]test it out in practice[/url].

The results were pretty good, with speedups (when run on 8800 Ultra) of ~36x over a single core of a Q6700 Core Duo (or ~9x over all four cores, assuming linear scaling). This is a pretty nice gain for an unoptimized piece of code written in 1.5 days by a curious astrophysicist :-).

In absolute terms, the GPU was able to compute ~110 million MD5 hashes per second, while it was even faster (~160 Mhash/sec) when the results weren't stored to global memory but instead compared against a "target" hash and discarded if no match was found ("search mode"). In laymen terms, this means a single 8800 Ultra can brute-force break an MD5 hashed password of eight or less alphanumerics+numbers (A-Z, a-z, 0-9) in about ~16 days. Kind of scary.

A short writeup, together with an exploration (and pretty plots!) of how the performance depends on the block size, number of blocks, threads, etc., can be found at [url="http://majuric.org/software/cudamd5/"]http://majuric.org/software/cudamd5/[/url]. The source code is there as well. Feel free to download/use/modify it.

PS: If you download the code, I'd be especially interested to hear how it fares on other GPUs (especially the G92 series). Just run 'make benchmark', and e-mail me the results!

Cheers,
Hi all,

We've recently had an interesting discussion at a local HPC Users Group meeting, on how (in theory) GPUs should be perfect for a range of cryptanalysis/cryptography applications. So last weekend I wrote a simple CUDA MD5 hash computation routine (based on RSA's Md5.c source) to test it out in practice.



The results were pretty good, with speedups (when run on 8800 Ultra) of ~36x over a single core of a Q6700 Core Duo (or ~9x over all four cores, assuming linear scaling). This is a pretty nice gain for an unoptimized piece of code written in 1.5 days by a curious astrophysicist :-).



In absolute terms, the GPU was able to compute ~110 million MD5 hashes per second, while it was even faster (~160 Mhash/sec) when the results weren't stored to global memory but instead compared against a "target" hash and discarded if no match was found ("search mode"). In laymen terms, this means a single 8800 Ultra can brute-force break an MD5 hashed password of eight or less alphanumerics+numbers (A-Z, a-z, 0-9) in about ~16 days. Kind of scary.



A short writeup, together with an exploration (and pretty plots!) of how the performance depends on the block size, number of blocks, threads, etc., can be found at http://majuric.org/software/cudamd5/. The source code is there as well. Feel free to download/use/modify it.



PS: If you download the code, I'd be especially interested to hear how it fares on other GPUs (especially the G92 series). Just run 'make benchmark', and e-mail me the results!



Cheers,

#1
Posted 03/15/2008 12:30 AM   
isn't the 8800GTS a G92 core?

Very interesting btw, that 63 threads per block can be optimal.

I am afraid I have quite some code that is coded for a specific number of threads per block, seeing this I think I will have try to make some of my kernels that use shared memory based on the number of threads more generic in this respect and do some extra benchmarking.
isn't the 8800GTS a G92 core?



Very interesting btw, that 63 threads per block can be optimal.



I am afraid I have quite some code that is coded for a specific number of threads per block, seeing this I think I will have try to make some of my kernels that use shared memory based on the number of threads more generic in this respect and do some extra benchmarking.

#2
Posted 03/15/2008 06:40 AM   
[b]mjuric[/b],
Nice job!

However, it's all been done before, check [url="http://www.elcomsoft.com/edpr.html"]this[/url] for little more information and performance figures.
mjuric,

Nice job!



However, it's all been done before, check this for little more information and performance figures.

// everything is reversible

#3
Posted 03/15/2008 09:48 AM   
[quote name='DenisR' date='Mar 15 2008, 01:40 AM']isn't the 8800GTS a G92 core?
[/quote]

Not if it is the 8800 GTS with 320 or 640 MB of memory. Those cards came out shortly after the 8800 GTX and use the G80 core. Someone at NVIDIA decided it would be OK to reuse the model name for the revised card. You have to either write "8800 GTS (G92)" or "8800 GTS 512 MB" (or 1 GB) to be clear you are talking about the newer card.
[quote name='DenisR' date='Mar 15 2008, 01:40 AM']isn't the 8800GTS a G92 core?





Not if it is the 8800 GTS with 320 or 640 MB of memory. Those cards came out shortly after the 8800 GTX and use the G80 core. Someone at NVIDIA decided it would be OK to reuse the model name for the revised card. You have to either write "8800 GTS (G92)" or "8800 GTS 512 MB" (or 1 GB) to be clear you are talking about the newer card.

#4
Posted 03/15/2008 03:40 PM   
AndreiB: Yes, I became aware of Elcomsoft's work halfway through coding mine. But I finished it out of academic curiosity & for tinkering value (I couldn't any actual CUDA MD5 sourcecode that I could compile and play with myself).
seibert: My GTS is the old one (it's actually a QuadroFX 4600).

PS: I made [url="http://majuric.org/software/cudamd5/#_hash_computation"]performance vs. threads per block plots for 8800 GTS 512[/url] (a G92 card; data courtesy of Ales Koval). I'm surprised by substantial the change in 'calc' test results, compared to the old 8800 GTS (and little difference in the 'search' test). Given that the main difference between the two modes is reading/writing to global memory, the G92s seem to be much better at memory access management than the old G80ies.
AndreiB: Yes, I became aware of Elcomsoft's work halfway through coding mine. But I finished it out of academic curiosity & for tinkering value (I couldn't any actual CUDA MD5 sourcecode that I could compile and play with myself).

seibert: My GTS is the old one (it's actually a QuadroFX 4600).



PS: I made performance vs. threads per block plots for 8800 GTS 512 (a G92 card; data courtesy of Ales Koval). I'm surprised by substantial the change in 'calc' test results, compared to the old 8800 GTS (and little difference in the 'search' test). Given that the main difference between the two modes is reading/writing to global memory, the G92s seem to be much better at memory access management than the old G80ies.

#5
Posted 03/21/2008 06:55 PM   
Hey mjuric, thanks for sharing your code!

haven't tested yet but looking forward to it!

btw, i have written multi-threaded SIMD version of MD5 for CORE2.

On a Q6600 unclocked 2.4ghz, my MD5 program with salts crunches 88,000,000 keys a second, which is 2x faster than MDCrack.

Hopefully the algorithm will be ported to GPU at some point, when i get hang of programming them of course.
Hey mjuric, thanks for sharing your code!



haven't tested yet but looking forward to it!



btw, i have written multi-threaded SIMD version of MD5 for CORE2.



On a Q6600 unclocked 2.4ghz, my MD5 program with salts crunches 88,000,000 keys a second, which is 2x faster than MDCrack.



Hopefully the algorithm will be ported to GPU at some point, when i get hang of programming them of course.

#6
Posted 04/14/2008 09:09 AM   
I have made some benchmarks of both CUDA-based and CPU-based md5 bruteforcers, you may find it here: [url="http://3.14.by/en/read/md5_benchmark"]http://3.14.by/en/read/md5_benchmark[/url]
I have made some benchmarks of both CUDA-based and CPU-based md5 bruteforcers, you may find it here: http://3.14.by/en/read/md5_benchmark

#7
Posted 07/20/2008 11:32 PM   
Hi BarsMonster,
Your work are great!
Why don"t you add cudamd5-v1.2.1 (http://majuric.org/software/cudamd5/) in the benchmark?
Could you please provide BarsWF in 32-bit version because other competitors are in 32-bit version?

With best regards,
Deguang


[quote name='BarsMonster' date='Jul 20 2008, 03:32 PM']I have made some benchmarks of both CUDA-based and CPU-based md5 bruteforcers, you may find it here: [url="http://3.14.by/en/read/md5_benchmark"]http://3.14.by/en/read/md5_benchmark[/url]
[right][snapback]413585[/snapback][/right]
[/quote]
Hi BarsMonster,

Your work are great!

Why don"t you add cudamd5-v1.2.1 (http://majuric.org/software/cudamd5/) in the benchmark?

Could you please provide BarsWF in 32-bit version because other competitors are in 32-bit version?



With best regards,

Deguang





[quote name='BarsMonster' date='Jul 20 2008, 03:32 PM']I have made some benchmarks of both CUDA-based and CPU-based md5 bruteforcers, you may find it here: http://3.14.by/en/read/md5_benchmark

[snapback]413585[/snapback]


#8
Posted 07/21/2008 02:11 AM   
32-bit versions for both CUDA based and SSE based versions are available at [url="http://3.14.by/en/md5"]http://3.14.by/en/md5[/url] .

Unfortunately cudamd5 is Linux only, and I do not feel that I can recompile it on windows quickly.
Based on speed benchmark in the beginning of the thread (110 million on 8800 ultra) it is going to be twice slower than CPU (9600GT C2D 3Ghz).
32-bit versions for both CUDA based and SSE based versions are available at http://3.14.by/en/md5 .



Unfortunately cudamd5 is Linux only, and I do not feel that I can recompile it on windows quickly.

Based on speed benchmark in the beginning of the thread (110 million on 8800 ultra) it is going to be twice slower than CPU (9600GT C2D 3Ghz).

#9
Posted 07/27/2008 08:07 AM   
[quote name='mjuric' post='342889' date='Mar 14 2008, 05:30 PM']Hi all,
We've recently had an interesting discussion at a local HPC Users Group meeting, on how (in theory) GPUs should be perfect for a range of cryptanalysis/cryptography applications. So last weekend I wrote a simple CUDA MD5 hash computation routine (based on RSA's Md5.c source) to [url="http://majuric.org/software/cudamd5/"]test it out in practice[/url].

The results were pretty good, with speedups (when run on 8800 Ultra) of ~36x over a single core of a Q6700 Core Duo (or ~9x over all four cores, assuming linear scaling). This is a pretty nice gain for an unoptimized piece of code written in 1.5 days by a curious astrophysicist :-).

In absolute terms, the GPU was able to compute ~110 million MD5 hashes per second, while it was even faster (~160 Mhash/sec) when the results weren't stored to global memory but instead compared against a "target" hash and discarded if no match was found ("search mode"). In laymen terms, this means a single 8800 Ultra can brute-force break an MD5 hashed password of eight or less alphanumerics+numbers (A-Z, a-z, 0-9) in about ~16 days. Kind of scary.

A short writeup, together with an exploration (and pretty plots!) of how the performance depends on the block size, number of blocks, threads, etc., can be found at [url="http://majuric.org/software/cudamd5/"]http://majuric.org/software/cudamd5/[/url]. The source code is there as well. Feel free to download/use/modify it.

PS: If you download the code, I'd be especially interested to hear how it fares on other GPUs (especially the G92 series). Just run 'make benchmark', and e-mail me the results!

Cheers,[/quote]

Interesting thread. I started looking at using a NVS 295 card in my workstation for some password auditing and came across this thread. Given a hash that is known to come from a maximum of 8-chars, it might practically take longer than 16 days. Here's why, please correct me if I am wrong.

The possible number of characters that can be used are 26 uppercase, 26 lowercase, 10 digits, and 32 special characters. Total 94. An 8-char password would yield 94^8 permutations. But the password you are looking to crack could be less than 8 characters unless you have enforced some password policy that requires no less than 8-chars. So the number of permutations is 94^1 + 94^2 ..... + 94^8 = 6161234432565770. At 110 Mhash/sec, it would take ~648 days to scan all possible permutations.
[quote name='mjuric' post='342889' date='Mar 14 2008, 05:30 PM']Hi all,

We've recently had an interesting discussion at a local HPC Users Group meeting, on how (in theory) GPUs should be perfect for a range of cryptanalysis/cryptography applications. So last weekend I wrote a simple CUDA MD5 hash computation routine (based on RSA's Md5.c source) to test it out in practice.



The results were pretty good, with speedups (when run on 8800 Ultra) of ~36x over a single core of a Q6700 Core Duo (or ~9x over all four cores, assuming linear scaling). This is a pretty nice gain for an unoptimized piece of code written in 1.5 days by a curious astrophysicist :-).



In absolute terms, the GPU was able to compute ~110 million MD5 hashes per second, while it was even faster (~160 Mhash/sec) when the results weren't stored to global memory but instead compared against a "target" hash and discarded if no match was found ("search mode"). In laymen terms, this means a single 8800 Ultra can brute-force break an MD5 hashed password of eight or less alphanumerics+numbers (A-Z, a-z, 0-9) in about ~16 days. Kind of scary.



A short writeup, together with an exploration (and pretty plots!) of how the performance depends on the block size, number of blocks, threads, etc., can be found at http://majuric.org/software/cudamd5/. The source code is there as well. Feel free to download/use/modify it.



PS: If you download the code, I'd be especially interested to hear how it fares on other GPUs (especially the G92 series). Just run 'make benchmark', and e-mail me the results!



Cheers,



Interesting thread. I started looking at using a NVS 295 card in my workstation for some password auditing and came across this thread. Given a hash that is known to come from a maximum of 8-chars, it might practically take longer than 16 days. Here's why, please correct me if I am wrong.



The possible number of characters that can be used are 26 uppercase, 26 lowercase, 10 digits, and 32 special characters. Total 94. An 8-char password would yield 94^8 permutations. But the password you are looking to crack could be less than 8 characters unless you have enforced some password policy that requires no less than 8-chars. So the number of permutations is 94^1 + 94^2 ..... + 94^8 = 6161234432565770. At 110 Mhash/sec, it would take ~648 days to scan all possible permutations.

#10
Posted 09/24/2009 06:34 AM   
[quote name='losttoy' post='591540' date='Sep 24 2009, 10:34 AM']So the number of permutations is 94^1 + 94^2 ..... + 94^8 = 6161234432565770. At 110 Mhash/sec, it would take ~648 days to scan all possible permutations.[/quote]
This thread is year+ old. Since then new GPUs arrived, some optimizations made, one GTX295 able to do 1400M MD5 hashes/sec, one 4870x2 even more -- 2300M, fresh 5870 -- (my estimations) is around 2665M. So cracking 8 symbols long MD5 password containing even all printable characters can be done in less than 14 days with system costs less than $1500.

If you're interested in MD5 hashing speed on different systems (including estimations for GT300 & HD5870) you can read this:
[url="http://golubev.com/about_cpu_and_gpu_2_en.htm"]http://golubev.com/about_cpu_and_gpu_2_en.htm[/url]
[quote name='losttoy' post='591540' date='Sep 24 2009, 10:34 AM']So the number of permutations is 94^1 + 94^2 ..... + 94^8 = 6161234432565770. At 110 Mhash/sec, it would take ~648 days to scan all possible permutations.

This thread is year+ old. Since then new GPUs arrived, some optimizations made, one GTX295 able to do 1400M MD5 hashes/sec, one 4870x2 even more -- 2300M, fresh 5870 -- (my estimations) is around 2665M. So cracking 8 symbols long MD5 password containing even all printable characters can be done in less than 14 days with system costs less than $1500.



If you're interested in MD5 hashing speed on different systems (including estimations for GT300 & HD5870) you can read this:

http://golubev.com/about_cpu_and_gpu_2_en.htm

#11
Posted 09/24/2009 08:37 AM   
Scroll To Top