What is the size of warp?

keizof · March 18, 2015, 3:12pm

Hello!!
I just started CUDA programming with GTX960 which has 32 waprs.
How do I understand what the size of warp means?
I wrote a very small sample program.
Does this program show the limit of concurrency?

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define N (64)

__global__ void kernel(long long int *dA) {
    int i= threadIdx.x + blockDim.x * blockIdx.x;
    dA[i] = clock64();
}

int main(void)
{
    int i;
    long long int A[N]; /* long long */
    long long int *dA;  /* long long pointer */

    for(i=0;i<N;i++) {
        A[i]=0;
    }

    size_t size=N*sizeof(long long int);
    cudaMalloc((void **)&dA, size);
    cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice);
    kernel<<<1,N>>>(dA);
    cudaMemcpy(A,dA, size, cudaMemcpyDeviceToHost);

    for(i=0;i<N; i++) {
        printf(" %3d %lld", i, A[i]); /* long long */
        if ( (i % 8)==7 ) printf("\n");
    }

    cudaFree(dA);

    exit(0);

}

$ nvcc -arch sm_52 clock.cu -o clock
$ ./clock
0 16603467 1 16603467 2 16603467 3 16603467 4 16603467 5 16603467 6 16603467 7 16603467
8 16603467 9 16603467 10 16603467 11 16603467 12 16603467 13 16603467 14 16603467 15 16603467
16 16603467 17 16603467 18 16603467 19 16603467 20 16603467 21 16603467 22 16603467 23 16603467
24 16603467 25 16603467 26 16603467 27 16603467 28 16603467 29 16603467 30 16603467 31 16603467
32 16603469 33 16603469 34 16603469 35 16603469 36 16603469 37 16603469 38 16603469 39 16603469
40 16603469 41 16603469 42 16603469 43 16603469 44 16603469 45 16603469 46 16603469 47 16603469
48 16603469 49 16603469 50 16603469 51 16603469 52 16603469 53 16603469 54 16603469 55 16603469
56 16603469 57 16603469 58 16603469 59 16603469 60 16603469 61 16603469 62 16603469 63 16603469

The first 32 values are same but 32nd (counting from zero) value is not same.
Next 32 values are same.

Robert_Crovella · March 18, 2015, 3:42pm

The size of a warp is 32, on current hardware. The nature of the clock function is that it will return the same value to every member of the warp that is participating in the read of the clock value.

The clock function is further described in the programming manual:

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#time-function[/url]

As is the concept and definition of warp:

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation[/url]

keizof · March 20, 2015, 2:34am

Thank you, txbob!!

I just wanted to see or feel the warp.
I am not sure the causality dilemma, known as “Chiken or the egg”.
It might be similar to “Clock value or the size of warp”.

I am worried about the value, which was NOT cached or recycled or reused.
I modified my program and now I feel the barrier of warp.

It might look like the sound barrier or light baririer?

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define M (33)
#define N (8)

__device__ long long int dA[M][N];
__global__ void kernel()
{
    int k;
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    for (k = 0; k < N; k++) {
        dA[i][k] = clock64();
    }
}

int main(void)
{

    int i, j;
    long long int A[M][N];

    for (j = 0; j < M; j++) {
        for (i = 0; i < N; i++) {
            A[j][i] = (clock_t) 0;
        }
    }

    size_t size = M * N * sizeof(clock_t);

    cudaMemcpyToSymbol(dA, A, size, 0, cudaMemcpyHostToDevice);
    kernel <<< 1, M >>> ();
    cudaMemcpyFromSymbol(A, dA, size, 0, cudaMemcpyDeviceToHost);

    for (j = 0; j < M; j++) {
        printf("%02d: ", j);
        for (i = 0; i < N; i++) {
            printf(" %lld", A[j][i]);   /* long long */
        }
        printf("\n");
    }

    exit(0);
}

$ ./clock-2d
00: 16799861 16799877 16799891 16799905 16799935 16799953 16799965 16800113
01: 16799861 16799877 16799891 16799905 16799935 16799953 16799965 16800113
02: 16799861 16799877 16799891 16799905 16799935 16799953 16799965 16800113
03: 16799861 16799877 16799891 16799905 16799935 16799953 16799965 16800113
04: 16799861 16799877 16799891 16799905 16799935 16799953 16799965 16800113
…
30: 16799861 16799877 16799891 16799905 16799935 16799953 16799965 16800113
31: 16799861 16799877 16799891 16799905 16799935 16799953 16799965 16800113
32: 16799863 16799879 16799893 16799907 16799937 16799959 16799969 16800115

MutantJohn · March 20, 2015, 2:49pm

It might look like the sound or light barrier? What?

keizof · March 25, 2015, 1:21pm

Sorry MutantJohn!!

The sound varrier comes from this movie, “The Right Stuff”.

I am feeling the number of warps is like an invisible wall.
Is it difficult to see or explain?

MutantJohn · March 25, 2015, 3:11pm

To understand warps, understand that processes abstract the instruction set into threads. In CUDA, these threads are executed in lockstep. Each thread of a kernel uses the same instruction set and these are executed 32 threads at a time or rather, 32 of the same instruction at a time. There is no invisible wall or anything such as that. It’s just the way GPUs are made right now.

I’m sure people would love it if warps were done away with. I know I would. It’d be awesome to get the same speed without having to execute instructions in such a rigid lockstep, where a single if-statement can make everything go awry.

keizof · March 26, 2015, 3:43pm

Thank you, MutantJohn.

I think both of theory and practice are very important.
Sometimes theory comes first and sometimes practice comes first.
Let me know if you already have a code that shows warps.
I am afraid that no one believes the number of warps using cudaGetDeviceProperties().

MutantJohn · March 26, 2015, 4:12pm

Literally any CUDA kernel code “shows warps”.

__global__
void kernel(float *x, const int array_size)
{
    const int tid = getCurrentThreadID();

    if (tid >= array_size)
        return;

    x[tid] = sqrt(tid * x[tid]);
}

A set of threads containing these instructions is generated. The GPU executes instructions from these threads 32 at a time, it’s just how they work. The only caveat is that for every set of 32 threads being executed, it has to be the same instruction. So, if tid is larger than the size of the array then the instruction set changes so for every thread that wants to return, a separate needs to execute it.

This is where concepts like occupancy and divergence come into play which you should definitely research on your own.

keizof · March 26, 2015, 9:39pm

Thank you, MutantJohn!!

Let me ask again.
Does your code show the timing?

Each thread of a kernel uses the same instruction set
and these are executed 32 threads at a time or rather,
32 of the same instruction at a time.