Slow perfomance when calculating SHA256
Calculating sha256 hash against 10MB data on GTX 1070 is taking just under 1 minute to complete. Is this normal? I mean on CPU (using sha256sum command) I get the result almost instantly on the same 10MB input file. I understand that sha256 algorithm can not be parallelized, only the number of concurrent threads can be increased. And I've tested that running the program on GPU with two 10MB files takes the same time as running the program with one file. What I am not sure about, is the very big time difference (almost 1 min). If interested, here is my code, nvprof output and specs. Oh, and Im runnign on Ubuntu 16.04 x64, with latest cuda. https://github.com/Horkyze/CudaSHA256 [code] ==22686== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 57.8597s 1 57.8597s 57.8597s 57.8597s sha256_cuda(JOB**, int) 0.00% 736ns 1 736ns 736ns 736ns [CUDA memcpy HtoD] API calls: 99.54% 57.8597s 1 57.8597s 57.8597s 57.8597s cudaDeviceSynchronize 0.31% 180.73ms 10 18.073ms 45.257us 179.20ms cudaMallocManaged 0.15% 87.775ms 1 87.775ms 87.775ms 87.775ms cudaDeviceReset 0.00% 428.18us 94 4.5550us 628ns 161.54us cuDeviceGetAttribute 0.00% 118.66us 1 118.66us 118.66us 118.66us cuDeviceTotalMem 0.00% 86.322us 1 86.322us 86.322us 86.322us cudaLaunch 0.00% 50.983us 1 50.983us 50.983us 50.983us cudaMemcpyToSymbol 0.00% 40.856us 1 40.856us 40.856us 40.856us cuDeviceGetName 0.00% 34.007us 22 1.5450us 768ns 5.0980us cudaGetLastError 0.00% 5.5870us 2 2.7930us 838ns 4.7490us cudaSetupArgument 0.00% 3.8410us 3 1.2800us 768ns 2.0250us cuDeviceGetCount 0.00% 2.0950us 2 1.0470us 908ns 1.1870us cuDeviceGet 0.00% 1.8160us 1 1.8160us 1.8160us 1.8160us cudaConfigureCall [/code] My specs [code] ./deviceQuery/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 1070" CUDA Driver Version / Runtime Version 9.0 / 9.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8114 MBytes (8507752448 bytes) (15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores GPU Max Clock rate: 1785 MHz (1.78 GHz) Memory Clock rate: 4004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1 Result = PASS [/code]
Calculating sha256 hash against 10MB data on GTX 1070 is taking just under 1 minute to complete.
Is this normal? I mean on CPU (using sha256sum command) I get the result almost instantly on the same 10MB input file.

I understand that sha256 algorithm can not be parallelized, only the number of concurrent threads can be increased. And I've tested that running the program on GPU with two 10MB files takes the same time as running the program with one file.

What I am not sure about, is the very big time difference (almost 1 min).

If interested, here is my code, nvprof output and specs. Oh, and Im runnign on Ubuntu 16.04 x64, with latest cuda.

https://github.com/Horkyze/CudaSHA256


==22686== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 57.8597s 1 57.8597s 57.8597s 57.8597s sha256_cuda(JOB**, int)
0.00% 736ns 1 736ns 736ns 736ns [CUDA memcpy HtoD]
API calls: 99.54% 57.8597s 1 57.8597s 57.8597s 57.8597s cudaDeviceSynchronize
0.31% 180.73ms 10 18.073ms 45.257us 179.20ms cudaMallocManaged
0.15% 87.775ms 1 87.775ms 87.775ms 87.775ms cudaDeviceReset
0.00% 428.18us 94 4.5550us 628ns 161.54us cuDeviceGetAttribute
0.00% 118.66us 1 118.66us 118.66us 118.66us cuDeviceTotalMem
0.00% 86.322us 1 86.322us 86.322us 86.322us cudaLaunch
0.00% 50.983us 1 50.983us 50.983us 50.983us cudaMemcpyToSymbol
0.00% 40.856us 1 40.856us 40.856us 40.856us cuDeviceGetName
0.00% 34.007us 22 1.5450us 768ns 5.0980us cudaGetLastError
0.00% 5.5870us 2 2.7930us 838ns 4.7490us cudaSetupArgument
0.00% 3.8410us 3 1.2800us 768ns 2.0250us cuDeviceGetCount
0.00% 2.0950us 2 1.0470us 908ns 1.1870us cuDeviceGet
0.00% 1.8160us 1 1.8160us 1.8160us 1.8160us cudaConfigureCall


My specs
./deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1070"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8114 MBytes (8507752448 bytes)
(15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

#1
Posted 11/01/2017 03:45 PM   
You'll want to compute thousands of SHA256 rounds in parallel. This is going to be fast. If you're working on a single 10MB data stream where the input state of each message block depends on the previous block's output there's really nothing a GPU can do to speed up your processing. This is inherently sequential. If your test case consisted of running the entire 10MB hash single threaded on one thread block in a grid consisting of 1 blocks, then congratulations: you've used 1/1920th of your GPUs processing resources. Does this explain the runtime difference to you?
You'll want to compute thousands of SHA256 rounds in parallel. This is going to be fast.

If you're working on a single 10MB data stream where the input state of each message block depends on the previous block's output there's really nothing a GPU can do to speed up your processing. This is inherently sequential.

If your test case consisted of running the entire 10MB hash single threaded on one thread block in a grid consisting of 1 blocks, then congratulations: you've used 1/1920th of your GPUs processing resources. Does this explain the runtime difference to you?

#2
Posted 11/01/2017 04:33 PM   
I understand that hashing one block is a ridiculous use case for GPU. Im just wondering that it takes so long -- still new to Cuda, so I dont yet know the performance baseline. However my feeling is that 1 min for hashing 10MB file is just too long and the problem may be in my algorithm implementation. Or if you guys on this forum say that 1 min is perfectly OK for such task, I can accept that :)
I understand that hashing one block is a ridiculous use case for GPU.

Im just wondering that it takes so long -- still new to Cuda, so I dont yet know the performance baseline.
However my feeling is that 1 min for hashing 10MB file is just too long and the problem may be in my algorithm implementation.

Or if you guys on this forum say that 1 min is perfectly OK for such task, I can accept that :)

#3
Posted 11/01/2017 05:05 PM   
It may very well be that your implementation runs slowly on the GPU. But without you posting a complete, compilable piece of code there's no way for us to tell. ;) But you're talking to the right person here. I've done a fair share of hash algorithms on GPU and optimized them for speed. cbuchner1 <-- cudaminer, ccminer developer (until early 2015)
It may very well be that your implementation runs slowly on the GPU. But without you posting a complete, compilable piece of code there's no way for us to tell. ;)

But you're talking to the right person here. I've done a fair share of hash algorithms on GPU and optimized them for speed.

cbuchner1 <-- cudaminer, ccminer developer (until early 2015)

#4
Posted 11/01/2017 05:25 PM   
Good to hear that, I was actually inspired by cudaminer, well written :) I've made some cosmetic changes to the code and repo - should be more clear now: https://github.com/Horkyze/CudaSHA256 In the meantime I will try to rewrite the code for CPU only (no Cuda) and see is there is any increase in speed. I think it should hash as fast as sha256sum utility - if it does not, then the problem is the algorithm implementation - which is from here https://github.com/B-Con/crypto-algorithms/blob/master/sha256.c
Good to hear that, I was actually inspired by cudaminer, well written :)

I've made some cosmetic changes to the code and repo - should be more clear now:

https://github.com/Horkyze/CudaSHA256


In the meantime I will try to rewrite the code for CPU only (no Cuda) and see is there is any increase in speed. I think it should hash as fast as sha256sum utility - if it does not, then the problem is the algorithm implementation - which is from here https://github.com/B-Con/crypto-algorithms/blob/master/sha256.c

#5
Posted 11/01/2017 06:59 PM   
From quickly looking at your code I see that you have plenty of operations that operate on arrays of words or bytes. CUDA generally puts local array variables into local memory (which is like global memory in terms of access speed and latency). There are a lot of for loops in your code that iterate over such array elements. The only way to get your arrays to reside in registers is to fully unroll the for loops that iterate over your arrays. Registers will be an order of magnitude faster than local memory (depending a bit on how effective the L1 cache operates on your specific data access patterns). It is also problematic to use byte arrays (size 64) for your message blocks. If possible declare such arrays in the native register width of CUDA (32 bits, i.e. uint32_t) even if it makes appending single message bytes more cumbersome to perform in code. In cudaminer/ccminer we'd generally know the length of the data to hash at compilation time. Your code however is written to operate on streams of arbitrary length. It might be useful to have a fully optimized and unrolled sha256 round function for dealing with complete 64 byte message blocks and then have one not unrolled or optimized sha256 round function that deals with the few remaining bytes of the stream (plus any required padding). Also there's a pretty solid reason why a GPU based shasum utility might not be much faster than a CPU based one when operating on short (megabytes) streams: The required data PCIe based transfer to the GPU might be slower than just hashing it locally on the CPU (which may hold most of the stream in its caches already!) And then there's also the issue of not being able to parallelize a single stream hash computation due to data dependencies. BTW cudaminer is not so well written. It's written to get the job done and not to look pretty ;-)
From quickly looking at your code I see that you have plenty of operations that operate on arrays of words or bytes.

CUDA generally puts local array variables into local memory (which is like global memory in terms of access speed and latency). There are a lot of for loops in your code that iterate over such array elements. The only way to get your arrays to reside in registers is to fully unroll the for loops that iterate over your arrays. Registers will be an order of magnitude faster than local memory (depending a bit on how effective the L1 cache operates on your specific data access patterns).

It is also problematic to use byte arrays (size 64) for your message blocks. If possible declare such arrays in the native register width of CUDA (32 bits, i.e. uint32_t) even if it makes appending single message bytes more cumbersome to perform in code.

In cudaminer/ccminer we'd generally know the length of the data to hash at compilation time. Your code however is written to operate on streams of arbitrary length. It might be useful to have a fully optimized and unrolled sha256 round function for dealing with complete 64 byte message blocks and then have one not unrolled or optimized sha256 round function that deals with the few remaining bytes of the stream (plus any required padding).

Also there's a pretty solid reason why a GPU based shasum utility might not be much faster than a CPU based one when operating on short (megabytes) streams: The required data PCIe based transfer to the GPU might be slower than just hashing it locally on the CPU (which may hold most of the stream in its caches already!) And then there's also the issue of not being able to parallelize a single stream hash computation due to data dependencies.

BTW cudaminer is not so well written. It's written to get the job done and not to look pretty ;-)

#6
Posted 11/01/2017 08:19 PM   
Thanks for the response, I find it very informative :) [i]> CUDA generally puts local array variables into local memory (which is like global memory in terms of access speed and latency).[/i] By local memory you mean the "ram" for GPU and by global memory you mean the actual ram, that CPU uses? For example consider this C calls: [code] void * p; p = malloc(10); // this allocates is global memory - RAM cudaMalloc(&p, 10); // this allocates is local (device) memory cudaMallocManaged(&p, 10); // this allocates is unified memory, which is where? (on device or in RAM) [/code] Is there any speed difference when accessing data allocated with [b]cudaMalloc[/b] and [b]cudaMallocManaged[/b]? Also can it be argued that given my CPU frequency = 3.7 GHz and GPU frequency = 1.78, the same (one thread) code will run approximately 2x faster on CPU, neglecting the time for data transfer (host to device and then back)?
Thanks for the response, I find it very informative :)

> CUDA generally puts local array variables into local memory (which is like global memory in terms of access speed and latency).

By local memory you mean the "ram" for GPU and by global memory you mean the actual ram, that CPU uses?
For example consider this C calls:
void * p;
p = malloc(10); // this allocates is global memory - RAM
cudaMalloc(&p, 10); // this allocates is local (device) memory
cudaMallocManaged(&p, 10); // this allocates is unified memory, which is where? (on device or in RAM)

Is there any speed difference when accessing data allocated with cudaMalloc and cudaMallocManaged?

Also can it be argued that given my CPU frequency = 3.7 GHz and GPU frequency = 1.78, the same (one thread) code will run approximately 2x faster on CPU, neglecting the time for data transfer (host to device and then back)?

#7
Posted 11/06/2017 04:05 PM   
Global memory, local memory, shared memory, host memory, device memory are all well established CUDA terminology with very specific definitions. Please refer to the CUDA programming guide for the definitions. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html The best practices guide also has more information about these memory spaces on the GPU (in particular check the subsection about local memory) [url]http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#device-memory-spaces[/url] You cannot really compare performance per thread of a CPU and a GPU based on clock speed ratio alone. Individual instruction throughput plays a role, memory latencies and bandwidth, overal utilization of compute cores....
Global memory, local memory, shared memory, host memory, device memory are all well established CUDA terminology with very specific definitions. Please refer to the CUDA programming guide for the definitions.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

The best practices guide also has more information about these memory spaces on the GPU (in particular check the subsection about local memory)

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#device-memory-spaces

You cannot really compare performance per thread of a CPU and a GPU based on clock speed ratio alone. Individual instruction throughput plays a role, memory latencies and bandwidth, overal utilization of compute cores....

#8
Posted 11/06/2017 04:10 PM   
@matej.bellus Your code contains some errors. In your JOB struct (same in JOB_init), you declared a 64 bytes digest which should be 32 bytes (256 bits) Also, in your hash_to_string function, you do a malloc(70) (why not 65 ?) but you do not free memory.
@matej.bellus Your code contains some errors.

In your JOB struct (same in JOB_init), you declared a 64 bytes digest which should be 32 bytes (256 bits)

Also, in your hash_to_string function, you do a malloc(70) (why not 65 ?) but you do not free memory.

#9
Posted 01/03/2018 10:16 PM   
Scroll To Top

Add Reply