Cuda sha256 calculations improvements
I'm new to cuda development, I tried to do a program which illustrates the bitcoin mining difficulty. You enter a string then it finds an associated nonce prepended to the input string matching a sha256 with some number of zeros at the beggining. For example if I want a difficulty of 6 and I put "moffa13" as input string, the program returns 6253010moffa13 which has this associated hash : 0000002dece0c0f5791305f53bfd5116966ea97a9604984cbb50891f243e5641 (6 zeros before) My program can currently do 2-4MH/s. I created the same program but for cpu and I can do over 6 millions hashes in a second. I'm pretty sure I can optimize this but I don't know how. What I'm doing is that I run a for loop which starts a kernel and processes hashes. One thread does one hash as it can't be parallelised. If one thread finds the right nonce, a global variable is set to 1 then the other threads return. After the kernel call, I run cudaDeviceSynchronize and I check if the global variable is set to 1. If not, I rerun the kernel with an updated nonce offset. Is it the right way, calling multiple times the kernel ? Maybe should I change my code's design ? Please tell me what should be changed in order to optimize this :) Here's the github : https://github.com/moffa13/SHA256CUDA Here's the cpu version : https://github.com/moffa13/SHA256Speed Thanks !
I'm new to cuda development, I tried to do a program which illustrates the bitcoin mining difficulty.

You enter a string then it finds an associated nonce prepended to the input string matching a sha256 with some number of zeros at the beggining.

For example if I want a difficulty of 6 and I put "moffa13" as input string, the program returns

6253010moffa13 which has this associated hash :

0000002dece0c0f5791305f53bfd5116966ea97a9604984cbb50891f243e5641 (6 zeros before)

My program can currently do 2-4MH/s.
I created the same program but for cpu and I can do over 6 millions hashes in a second.

I'm pretty sure I can optimize this but I don't know how.

What I'm doing is that I run a for loop which starts a kernel and processes hashes. One thread does one hash as it can't be parallelised. If one thread finds the right nonce, a global variable is set to 1 then the other threads return. After the kernel call, I run cudaDeviceSynchronize and I check if the global variable is set to 1. If not, I rerun the kernel with an updated nonce offset.

Is it the right way, calling multiple times the kernel ?
Maybe should I change my code's design ?

Please tell me what should be changed in order to optimize this :)

Here's the github : https://github.com/moffa13/SHA256CUDA

Here's the cpu version : https://github.com/moffa13/SHA256Speed


Thanks !

#1
Posted 01/04/2018 11:18 PM   
Don't do in-kernel malloc. Allocate outside of your kernel calling loop. You're only running 4 blocks? That's going to be a perf limiter on nearly all GPUs. what GPU are you running on?
Don't do in-kernel malloc. Allocate outside of your kernel calling loop.

You're only running 4 blocks? That's going to be a perf limiter on nearly all GPUs.

what GPU are you running on?

#2
Posted 01/04/2018 11:30 PM   
Are you talking about the sha256.cuh file ? I have a GTX 970 GPU.
Are you talking about the sha256.cuh file ?

I have a GTX 970 GPU.

#3
Posted 01/04/2018 11:35 PM   
I did like you said, I can now calculate up to 45 millions of hash/s (I can hear my gpu whistling) but my program crashes after ~30sec How many blocks should I run parallel in a kernel ? I set 1024 threads in a block but don't know how much blocks. Also, what's the difference between cudaSetDevice(0) & cudaSetDevice(1) Can you give me a quick explanation of what can be achieved with cuda streams ?
I did like you said, I can now calculate up to 45 millions of hash/s (I can hear my gpu whistling) but my program crashes after ~30sec

How many blocks should I run parallel in a kernel ? I set 1024 threads in a block but don't know how much blocks.

Also, what's the difference between cudaSetDevice(0) & cudaSetDevice(1)
Can you give me a quick explanation of what can be achieved with cuda streams ?

#4
Posted 01/04/2018 11:59 PM   
A reasonable minimum target is to launch a total number of threads of at least # of SM * 2048. These should be split between blocks with usually something in the range of 128,256, or 512 threads per block. It might be that 1024 threads per block is "OK", it just requires some analysis to confirm. Your GTX970 has 13 SMs, so target 13*2048 = 26K threads, ballpark, minimum. If you put 1024 threads per block, that would be 26 blocks. I'm not saying I know how to transform your code from 4 blocks to 26, but that is a reasonable performance goal, to maximize throughput. I'm not going to try and teach you CUDA here on this forum. Use your google search to answer basic questions. There is documentation that will define what cudaSetDevice does: [url]http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb[/url] There is a programming guide that discusses streams: [url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams[/url] and of course there are a bazillion additional writeups and resources all over the web.
A reasonable minimum target is to launch a total number of threads of at least # of SM * 2048.

These should be split between blocks with usually something in the range of 128,256, or 512 threads per block. It might be that 1024 threads per block is "OK", it just requires some analysis to confirm.

Your GTX970 has 13 SMs, so target 13*2048 = 26K threads, ballpark, minimum. If you put 1024 threads per block, that would be 26 blocks. I'm not saying I know how to transform your code from 4 blocks to 26, but that is a reasonable performance goal, to maximize throughput.

I'm not going to try and teach you CUDA here on this forum. Use your google search to answer basic questions. There is documentation that will define what cudaSetDevice does:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb

There is a programming guide that discusses streams:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams

and of course there are a bazillion additional writeups and resources all over the web.

#5
Posted 01/05/2018 12:07 AM   
can you just start with existing sha256 GPU calculation code? there are lot of mining algo sources on githuи and anyway, you need to learn CUDA in order to get idea how GPU works and how code should be optimized. otherwise, you easily can get worse performance than on CPU
can you just start with existing sha256 GPU calculation code? there are lot of mining algo sources on githuи

and anyway, you need to learn CUDA in order to get idea how GPU works and how code should be optimized. otherwise, you easily can get worse performance than on CPU

#6
Posted 01/05/2018 12:23 AM   
Correct me if I'm wrong. I don't understand why you are talking about 4 blocks because I run 10 blocks of 1024 threads. If I run a single kernel, there is only one SM working ? So the max efficiency would be 2048 threads so it means 2 blocks with 1024 threads each. If i use 13 streams will the hashing power be 13x faster ?
Correct me if I'm wrong.

I don't understand why you are talking about 4 blocks because I run 10 blocks of 1024 threads.

If I run a single kernel, there is only one SM working ? So the max efficiency would be 2048 threads so it means 2 blocks with 1024 threads each.

If i use 13 streams will the hashing power be 13x faster ?

#7
Posted 01/05/2018 12:25 AM   
[quote=""]Correct me if I'm wrong. I don't understand why you are talking about 4 blocks because I run 10 blocks of 1024 threads.[/quote] I looked at this: [code]#define BLOCK_SIZE 1024 #define SHA_PER_ITERATIONS 3200 #define NUMBLOCKS (SHA_PER_ITERATIONS + BLOCK_SIZE - 1) / BLOCK_SIZE sha256_kernel << < NUMBLOCKS, BLOCK_SIZE >> > (g_out, g_hash_out, g_found, d_in, input_size, difficulty, nonce);[/code] (3200 + 1024 - 1)/1024 = 4 if you don't believe me, print out those quantities. [quote=""] If I run a single kernel, there is only one SM working ? [/quote] No, not correct. A kernel with enough blocks will fill all the SMs on a GPU, and this is generally desirable. Please avail yourself of an organized introduction to CUDA. I know you'd like to learn it in 5 minutes, but in my experience it takes longer than that. Piecemeal Q+A "Socratic" method might seem "efficient" but is actually quite inefficient for learning a body of knowledge like this, IMO. A basic intro sequence to CUDA might be: part1: [url]http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0624-Monday-Introduction-to-CUDA-C.pdf[/url] part2: [url]http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0514-GTC2012-GPU-Performance-Analysis.pdf[/url] If you want to learn what streams are for, google "gtc cuda streams" and take the first hit. [quote=""] If i use 13 streams will the hashing power be 13x faster ? [/quote] It's amazing how prevalent this line of thinking is when it comes to streams. If that were true, I'd advise you not to stop at 13.
said:Correct me if I'm wrong.

I don't understand why you are talking about 4 blocks because I run 10 blocks of 1024 threads.


I looked at this:

#define BLOCK_SIZE 1024
#define SHA_PER_ITERATIONS 3200
#define NUMBLOCKS (SHA_PER_ITERATIONS + BLOCK_SIZE - 1) / BLOCK_SIZE


sha256_kernel << < NUMBLOCKS, BLOCK_SIZE >> > (g_out, g_hash_out, g_found, d_in, input_size, difficulty, nonce);


(3200 + 1024 - 1)/1024 = 4

if you don't believe me, print out those quantities.

said:

If I run a single kernel, there is only one SM working ?


No, not correct. A kernel with enough blocks will fill all the SMs on a GPU, and this is generally desirable. Please avail yourself of an organized introduction to CUDA. I know you'd like to learn it in 5 minutes, but in my experience it takes longer than that. Piecemeal Q+A "Socratic" method might seem "efficient" but is actually quite inefficient for learning a body of knowledge like this, IMO.

A basic intro sequence to CUDA might be:

part1: http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0624-Monday-Introduction-to-CUDA-C.pdf

part2: http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0514-GTC2012-GPU-Performance-Analysis.pdf

If you want to learn what streams are for, google "gtc cuda streams" and take the first hit.

said:

If i use 13 streams will the hashing power be 13x faster ?



It's amazing how prevalent this line of thinking is when it comes to streams. If that were true, I'd advise you not to stop at 13.

#8
Posted 01/05/2018 12:43 AM   
no, each next thread block will go to the next SM this just shows how deep is your misunderstanding of GPU execution model. do we need to read aloud entire CUDA manual or you can read it yourself?
no, each next thread block will go to the next SM

this just shows how deep is your misunderstanding of GPU execution model. do we need to read aloud entire CUDA manual or you can read it yourself?

#9
Posted 01/05/2018 12:44 AM   
@txbob Oh sorry this is because I changed this value (3200) to 10240 between the posts. I already read the whole part1 but it's a really basic introduction ^^ Thanks for the second one I'll read it.
@txbob Oh sorry this is because I changed this value (3200) to 10240 between the posts.

I already read the whole part1 but it's a really basic introduction ^^

Thanks for the second one I'll read it.

#10
Posted 01/05/2018 12:50 AM   
[quote=""]this is because I changed this value (3200) to 10240 between the posts.[/quote] the github repo is the only thing I have to look at. It still says 3200 as of this posting, at this moment. The code I excerpted was pulled from your github repo, and still reflects that. Obviously I can't see what code you are actually running. Roughly speaking, the "part1" I posted teaches you how to write syntactically correct CUDA, that will give you the correct answer (arguably the first step/requirement for any programmer). The "part2" teaches the basics of how to write CUDA code that runs "fast". Your questions here mostly revolve around the latter, not the former. With a lack of understanding of the latter, you can easily write code that gives you the correct answer but runs slow. You may then arrive at the conclusion that GPUs are worthless. Presumably they are not.
said:this is because I changed this value (3200) to 10240 between the posts.


the github repo is the only thing I have to look at. It still says 3200 as of this posting, at this moment. The code I excerpted was pulled from your github repo, and still reflects that.

Obviously I can't see what code you are actually running.

Roughly speaking, the "part1" I posted teaches you how to write syntactically correct CUDA, that will give you the correct answer (arguably the first step/requirement for any programmer). The "part2" teaches the basics of how to write CUDA code that runs "fast". Your questions here mostly revolve around the latter, not the former.

With a lack of understanding of the latter, you can easily write code that gives you the correct answer but runs slow. You may then arrive at the conclusion that GPUs are worthless. Presumably they are not.

#11
Posted 01/05/2018 12:55 AM   
Hello, I still have a question. As each thread can find a correct answer to return to the host, how to only make one thread pass through the if statement which copies the data back to the host and avoid race conditions ? Example : [code] bool isOk = check(someVar); // Many different someVar can work if(isOk){ // I want this only accessed by one thread, if possible the one with the lowest id memcpy(h_someVar, someVar, x); } [/code] Edit: Never mind, I did this [code] bool isOk = check(someVar); // Many different someVar can work if(isOk && atomicExch(g_found, 1) == 0){ // I want this only accessed by one thread, if possible the one with the lowest id memcpy(h_someVar, someVar, x); } [/code]
Hello, I still have a question.

As each thread can find a correct answer to return to the host, how to only make one thread pass through the if statement which copies the data back to the host and avoid race conditions ?

Example :

bool isOk = check(someVar); // Many different someVar can work
if(isOk){
// I want this only accessed by one thread, if possible the one with the lowest id
memcpy(h_someVar, someVar, x);
}


Edit: Never mind, I did this

bool isOk = check(someVar); // Many different someVar can work
if(isOk && atomicExch(g_found, 1) == 0){
// I want this only accessed by one thread, if possible the one with the lowest id
memcpy(h_someVar, someVar, x);
}

#12
Posted 01/06/2018 12:07 PM   
Scroll To Top

Add Reply