Hello World in CUDA

bgalbraith · February 19, 2009, 10:01pm

Since CUDA introduces extensions to C and is not it’s own language, the typical Hello World application would be identical to C’s but wouldn’t provide any insight into using CUDA.

Here is my attempt to produce Hello World while actually showcasing the basic common features of a CUDA kernel. Enjoy

[codebox]/*

** Hello World using CUDA

**

** The string “Hello World!” is mangled then restored using a common CUDA idiom

**

** Byron Galbraith

** 2009-02-18

*/

#include <cuda.h>

#include <stdio.h>

// Prototypes

global void helloWorld(char*);

// Host function

int

main(int argc, char** argv)

{

int i;

// desired output

char str = “Hello World!”;

// mangle contents of output

// the null character is left intact for simplicity

for(i = 0; i < 12; i++)

str[i] -= i;

// allocate memory on the device

char *d_str;

size_t size = sizeof(str);

cudaMalloc((void**)&d_str, size);

// copy the string to the device

cudaMemcpy(d_str, str, size, cudaMemcpyHostToDevice);

// set the grid and block sizes

dim3 dimGrid(2); // one block per word

dim3 dimBlock(6); // one thread per character

// invoke the kernel

helloWorld<<< dimGrid, dimBlock >>>(d_str);

// retrieve the results from the device

cudaMemcpy(str, d_str, size, cudaMemcpyDeviceToHost);

// free up the allocated memory on the device

cudaFree(d_str);

// everyone’s favorite part

printf(“%s\n”, str);

return 0;

}

// Device kernel

global void

helloWorld(char* str)

{

// determine where in the thread grid we are

int idx = blockIdx.x * blockDim.x + threadIdx.x;

// unmangle output

str[idx] += idx;

}[/codebox]

kristleifur · February 19, 2009, 11:18pm

:D

Very cool initiative

Tandy · February 20, 2009, 3:22am

Very Nice.Thank you.

Since CUDA introduces extensions to C and is not it’s own language, the typical Hello World application would be identical to C’s but wouldn’t provide any insight into using CUDA.

Here is my attempt to produce Hello World while actually showcasing the basic common features of a CUDA kernel. Enjoy

[codebox]/*

** Hello World using CUDA

**

** The string “Hello World!” is mangled then restored using a common CUDA idiom

**

** Byron Galbraith

** 2009-02-18

*/

include <cuda.h>

include <stdio.h>

// Prototypes

global void helloWorld(char*);

// Host function

int

main(int argc, char** argv)

{

int i;

// desired output

char str = “Hello World!”;

// mangle contents of output

// the null character is left intact for simplicity

for(i = 0; i < 12; i++)
str[i] -= i;
// allocate memory on the device

char *d_str;

size_t size = sizeof(str);

cudaMalloc((void**)&d_str, size);

// copy the string to the device

cudaMemcpy(d_str, str, size, cudaMemcpyHostToDevice);

// set the grid and block sizes

dim3 dimGrid(2); // one block per word

dim3 dimBlock(6); // one thread per character

// invoke the kernel

helloWorld<<< dimGrid, dimBlock >>>(d_str);

// retrieve the results from the device

cudaMemcpy(str, d_str, size, cudaMemcpyDeviceToHost);

// free up the allocated memory on the device

cudaFree(d_str);

// everyone’s favorite part

printf(“%s\n”, str);

return 0;

}

// Device kernel

global void

helloWorld(char* str)

{

// determine where in the thread grid we are

int idx = blockIdx.x * blockDim.x + threadIdx.x;

// unmangle output

str[idx] += idx;

}[/codebox]

ushopfast.com · February 9, 2010, 3:53pm

Thanks!

This is a great start, it is the only example I have seen using strings, it would be nice to know if any one has done anything with text processing using GPU’s I haven’t even seen a simple example of string concantenation

heshsham_India · February 10, 2010, 1:31pm

One my students wrote the following code three months back, though I am looking at it for the first time today after looking at the above code:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

__global__ void print(char *a,int N)

{

	char p[11]="Hello CUDA";

	int idx=blockIdx.x*blockDim.x + threadIdx.x;

	if(idx<N)

		{

			a[idx]=p[idx];

		}

}

int main(void)

{

	char *a_h,*a_d;

	const int N=11;

	size_t size=N*sizeof(char);

	a_h=(char *)malloc(size);

	cudaMalloc((void **)&a_d,size);

	for(int i=0;i<N;i++)

	{

		a_h[i]=0;

	}

	cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

	int blocksize=4;

	int nblock=N/blocksize+(N%blocksize==0?0:1);

	print<<<nblock,blocksize>>>(a_d,N);

	cudaMemcpy(a_h,a_d,sizeof(char)*N,cudaMemcpyDeviceToHost);

	for(int i=0;i<N;i++)

	{

		printf("%c",a_h[i]);

	}

	free(a_h);

	cudaFree(a_d);

}

Cantagalo · April 5, 2010, 8:17pm

Hello guys!
I’m new on CUDA programming.
I tried to compile the posted code of Hello World in CUDA and I received the following error report:

Hello World.cpp:13: error: expected constructor, destructor, or type conversion before “void”
Hello World.cpp:13: error: expected ,' or ;’ before “void”
Hello World.cpp: In function int main(int, char**)': Hello World.cpp:32: error: cudaMalloc’ undeclared (first use this function)
Hello World.cpp:32: error: (Each undeclared identifier is reported only once for each function it appears in.)
Hello World.cpp:35: error: cudaMemcpyHostToDevice' undeclared (first use this function) Hello World.cpp:35: error: cudaMemcpy’ undeclared (first use this function)
Hello World.cpp:38: error: dim3' undeclared (first use this function) Hello World.cpp:38: error: expected ;’ before “dimGrid”
Hello World.cpp:39: error: expected ;' before "dimBlock" Hello World.cpp:42: error: helloWorld’ undeclared (first use this function)
Hello World.cpp:42: error: expected primary-expression before ‘<’ token
Hello World.cpp:42: error: dimGrid' undeclared (first use this function) Hello World.cpp:42: error: dimBlock’ undeclared (first use this function)
Hello World.cpp:42: error: expected primary-expression before ‘>’ token
Hello World.cpp:45: error: cudaMemcpyDeviceToHost' undeclared (first use this function) Hello World.cpp:48: error: cudaFree’ undeclared (first use this function)
Hello World.cpp: At global scope:
Hello World.cpp:57: error: expected constructor, destructor, or type conversion before “void”
Hello World.cpp:57: error: expected ,' or ;’ before “void”

Someone can help me eliminate these errors?
Thanks!
Michel.

avidday · April 5, 2010, 8:43pm

rename the file to .cu, nvcc will try and compile it as plain C++ otherwise, and as you can see, the C++ compiler doesn’t like the CUDA specific syntax.

Ingemar · April 23, 2010, 9:24am

Great to see that there is someone else out these who knows what “Hello World!” is supposed to do! Both your “hello” programs are a lot better than all the examples everybody call “Hello World!” but that are really just arbitrary simple demos.

Here is my own version, which I made a few months back.

// This is the REAL "hello world" for CUDA!

// It takes the string "Hello ", prints it, then passes it to CUDA with an array

// of offsets. Then the offsets are added in parallel to produce the string "World!"

// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 16; 

const int blocksize = 16; 

__global__ 

void hello(char *a, int *b) 

{

	a[threadIdx.x] += b[threadIdx.x];

}

int main()

{

	char a[N] = "Hello

// This is the REAL “hello world” for CUDA!

// It takes the string "Hello ", prints it, then passes it to CUDA with an array

// of offsets. Then the offsets are added in parallel to produce the string “World!”

// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 16;

const int blocksize = 16;

global

void hello(char *a, int *b)

{

a[threadIdx.x] += b[threadIdx.x];

}

int main()

{

char a[N] = "Hello \0\0\0\0\0\0";

int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

char *ad;

int *bd;

const int csize = N*sizeof(char);

const int isize = N*sizeof(int);

printf("%s", a);

cudaMalloc( (void**)&ad, csize ); 

cudaMalloc( (void**)&bd, isize ); 

cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 

cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 



dim3 dimBlock( blocksize, 1 );

dim3 dimGrid( 1, 1 );

hello<<<dimGrid, dimBlock>>>(ad, bd);

cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 

cudaFree( ad );



printf("%s\n", a);

return EXIT_SUCCESS;

}

// This is the REAL “hello world” for CUDA!

// It takes the string "Hello ", prints it, then passes it to CUDA with an array

// of offsets. Then the offsets are added in parallel to produce the string “World!”

// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 16;

const int blocksize = 16;

global

void hello(char *a, int *b)

{

a[threadIdx.x] += b[threadIdx.x];

}

int main()

{

char a[N] = "Hello \0\0\0\0\0\0";

int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

char *ad;

int *bd;

const int csize = N*sizeof(char);

const int isize = N*sizeof(int);

printf("%s", a);

cudaMalloc( (void**)&ad, csize ); 

cudaMalloc( (void**)&bd, isize ); 

cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 

cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 



dim3 dimBlock( blocksize, 1 );

dim3 dimGrid( 1, 1 );

hello<<<dimGrid, dimBlock>>>(ad, bd);

cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 

cudaFree( ad );



printf("%s\n", a);

return EXIT_SUCCESS;

}

// This is the REAL “hello world” for CUDA!

// It takes the string "Hello ", prints it, then passes it to CUDA with an array

// of offsets. Then the offsets are added in parallel to produce the string “World!”

// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 16;

const int blocksize = 16;

global

void hello(char *a, int *b)

{

a[threadIdx.x] += b[threadIdx.x];

}

int main()

{

char a[N] = "Hello \0\0\0\0\0\0";

int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

char *ad;

int *bd;

const int csize = N*sizeof(char);

const int isize = N*sizeof(int);

printf("%s", a);

cudaMalloc( (void**)&ad, csize ); 

cudaMalloc( (void**)&bd, isize ); 

cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 

cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 



dim3 dimBlock( blocksize, 1 );

dim3 dimGrid( 1, 1 );

hello<<<dimGrid, dimBlock>>>(ad, bd);

cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 

cudaFree( ad );



printf("%s\n", a);

return EXIT_SUCCESS;

}

// This is the REAL “hello world” for CUDA!

// It takes the string "Hello ", prints it, then passes it to CUDA with an array

// of offsets. Then the offsets are added in parallel to produce the string “World!”

// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 16;

const int blocksize = 16;

global

void hello(char *a, int *b)

{

a[threadIdx.x] += b[threadIdx.x];

}

int main()

{

char a[N] = "Hello \0\0\0\0\0\0";

int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

char *ad;

int *bd;

const int csize = N*sizeof(char);

const int isize = N*sizeof(int);

printf("%s", a);

cudaMalloc( (void**)&ad, csize ); 

cudaMalloc( (void**)&bd, isize ); 

cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 

cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 



dim3 dimBlock( blocksize, 1 );

dim3 dimGrid( 1, 1 );

hello<<<dimGrid, dimBlock>>>(ad, bd);

cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 

cudaFree( ad );



printf("%s\n", a);

return EXIT_SUCCESS;

}

// This is the REAL “hello world” for CUDA!

// It takes the string "Hello ", prints it, then passes it to CUDA with an array

// of offsets. Then the offsets are added in parallel to produce the string “World!”

// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 16;

const int blocksize = 16;

global

void hello(char *a, int *b)

{

a[threadIdx.x] += b[threadIdx.x];

}

int main()

{

char a[N] = "Hello \0\0\0\0\0\0";

int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

char *ad;

int *bd;

const int csize = N*sizeof(char);

const int isize = N*sizeof(int);

printf("%s", a);

cudaMalloc( (void**)&ad, csize ); 

cudaMalloc( (void**)&bd, isize ); 

cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 

cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 



dim3 dimBlock( blocksize, 1 );

dim3 dimGrid( 1, 1 );

hello<<<dimGrid, dimBlock>>>(ad, bd);

cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 

cudaFree( ad );



printf("%s\n", a);

return EXIT_SUCCESS;

}

// This is the REAL “hello world” for CUDA!

// It takes the string "Hello ", prints it, then passes it to CUDA with an array

// of offsets. Then the offsets are added in parallel to produce the string “World!”

// By Ingemar Ragnemalm 2010

#include <stdio.h>

const int N = 16;

const int blocksize = 16;

global

void hello(char *a, int *b)

{

a[threadIdx.x] += b[threadIdx.x];

}

int main()

{

char a[N] = "Hello \0\0\0\0\0\0";

int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

char *ad;

int *bd;

const int csize = N*sizeof(char);

const int isize = N*sizeof(int);

printf("%s", a);

cudaMalloc( (void**)&ad, csize ); 

cudaMalloc( (void**)&bd, isize ); 

cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 

cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 



dim3 dimBlock( blocksize, 1 );

dim3 dimGrid( 1, 1 );

hello<<<dimGrid, dimBlock>>>(ad, bd);

cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 

cudaFree( ad );



printf("%s\n", a);

return EXIT_SUCCESS;

}

";

	int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

	char *ad;

	int *bd;

	const int csize = N*sizeof(char);

	const int isize = N*sizeof(int);

	printf("%s", a);

	cudaMalloc( (void**)&ad, csize ); 

	cudaMalloc( (void**)&bd, isize ); 

	cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 

	cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 

	

	dim3 dimBlock( blocksize, 1 );

	dim3 dimGrid( 1, 1 );

	hello<<<dimGrid, dimBlock>>>(ad, bd);

	cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 

	cudaFree( ad );

	

	printf("%s\n", a);

	return EXIT_SUCCESS;

}

So now I see three versions here:

Mangle and mangle back the whole string.
Copy char by char from a string constant.
Produce latter half of the string by offsets from the first.

tertl3 · April 10, 2011, 5:07pm

Yeah, definitely cool programs for a beginner like me.

I’ll definitely be back here to learn some CUDA.

Thanks

tertl3 · April 12, 2011, 10:58pm

Is there any way you guys could post some follows in a thread of ‘beginner programs’

swaroop933 · May 19, 2011, 9:46pm

Hi Everyone, I am a newbie in CUDA as well as Visual Studio. I am trying to write a simple CUDA program in Visual Studio. I followed the link to setup visual studio 2010 and I added a new file helloworld.cu and the highlighting is done.

#include <cuda.h>

#include <stdio.h>

__global__ void kernel(void) {

}

int main(void) {

	kernel<<<1,1>>>();

	printf("HelloWorld");

	return 0;

}

I get the following errors

#include <stdio.h> – I get a red mark i.e Header is not recognizable
global – I get a red mark explaining this is not right declaration.
printf – Red mark
When I debug, I get an error “Cannot launch debugger. The required property 'VSInstallDir’is missing or emypty”

Can anyone suggest me what is the mistake, I have done.

biaspoint · January 12, 2012, 3:39am

Found a good tutorial to start with Visual 2010 and Cuda 4.0

http://www.stevenmarkford.com/installing-nvidia-cuda-with-visual-studio-2010/

biaspoint · January 12, 2012, 3:39am

Found a good tutorial to start with Visual 2010 and Cuda 4.0

http://www.stevenmarkford.com/installing-nvidia-cuda-with-visual-studio-2010/

Fumblelsc · March 15, 2012, 4:23pm

Hello everyone,
Just try to make an hello world but i have an error

// invoke the kernel
helloWorld<<< dimGrid, dimBlock >>>(d_str);

the third < is underline with an error

I don’t what i need to do to fix, i just paste the code in VS 2010

thanx for your help