Hello!!
I just started CUDA programming with GTX960 which has 32 waprs.
How do I understand what the size of warp means?
I wrote a very small sample program.
Does this program show the limit of concurrency?
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define N (64)
__global__ void kernel(long long int *dA) {
int i= threadIdx.x + blockDim.x * blockIdx.x;
dA[i] = clock64();
}
int main(void)
{
int i;
long long int A[N]; /* long long */
long long int *dA; /* long long pointer */
for(i=0;i<N;i++) {
A[i]=0;
}
size_t size=N*sizeof(long long int);
cudaMalloc((void **)&dA, size);
cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice);
kernel<<<1,N>>>(dA);
cudaMemcpy(A,dA, size, cudaMemcpyDeviceToHost);
for(i=0;i<N; i++) {
printf(" %3d %lld", i, A[i]); /* long long */
if ( (i % 8)==7 ) printf("\n");
}
cudaFree(dA);
exit(0);
}
The size of a warp is 32, on current hardware. The nature of the clock function is that it will return the same value to every member of the warp that is participating in the read of the clock value.
The clock function is further described in the programming manual:
I just wanted to see or feel the warp.
I am not sure the causality dilemma, known as “Chiken or the egg”.
It might be similar to “Clock value or the size of warp”.
I am worried about the value, which was NOT cached or recycled or reused.
I modified my program and now I feel the barrier of warp.
It might look like the sound barrier or light baririer?
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define M (33)
#define N (8)
__device__ long long int dA[M][N];
__global__ void kernel()
{
int k;
int i = threadIdx.x + blockDim.x * blockIdx.x;
for (k = 0; k < N; k++) {
dA[i][k] = clock64();
}
}
int main(void)
{
int i, j;
long long int A[M][N];
for (j = 0; j < M; j++) {
for (i = 0; i < N; i++) {
A[j][i] = (clock_t) 0;
}
}
size_t size = M * N * sizeof(clock_t);
cudaMemcpyToSymbol(dA, A, size, 0, cudaMemcpyHostToDevice);
kernel <<< 1, M >>> ();
cudaMemcpyFromSymbol(A, dA, size, 0, cudaMemcpyDeviceToHost);
for (j = 0; j < M; j++) {
printf("%02d: ", j);
for (i = 0; i < N; i++) {
printf(" %lld", A[j][i]); /* long long */
}
printf("\n");
}
exit(0);
}
To understand warps, understand that processes abstract the instruction set into threads. In CUDA, these threads are executed in lockstep. Each thread of a kernel uses the same instruction set and these are executed 32 threads at a time or rather, 32 of the same instruction at a time. There is no invisible wall or anything such as that. It’s just the way GPUs are made right now.
I’m sure people would love it if warps were done away with. I know I would. It’d be awesome to get the same speed without having to execute instructions in such a rigid lockstep, where a single if-statement can make everything go awry.
I think both of theory and practice are very important.
Sometimes theory comes first and sometimes practice comes first.
Let me know if you already have a code that shows warps.
I am afraid that no one believes the number of warps using cudaGetDeviceProperties().
__global__
void kernel(float *x, const int array_size)
{
const int tid = getCurrentThreadID();
if (tid >= array_size)
return;
x[tid] = sqrt(tid * x[tid]);
}
A set of threads containing these instructions is generated. The GPU executes instructions from these threads 32 at a time, it’s just how they work. The only caveat is that for every set of 32 threads being executed, it has to be the same instruction. So, if tid is larger than the size of the array then the instruction set changes so for every thread that wants to return, a separate needs to execute it.
This is where concepts like occupancy and divergence come into play which you should definitely research on your own.