endian mode of the device

Is the device big or little endian? thanks

I second that question. ;-)

when I look at the PTX (using the -keep) flags it mentions the endianity,
but it is not immediately clear if this is the host or device endianity.

    //-----------------------------------------------------------
    // Compiling test.cpp3.i (/var/folders/z0/z0ghFUySH68jPYxqgayNZ++++TI/-Tmp-/ccBI#.3xcUxK)
    //-----------------------------------------------------------

    //-----------------------------------------------------------
    // Options:
    //-----------------------------------------------------------
    //  Target:ptx, ISA:compute_10, Endian:little, Pointer Size:32
    //  -O3 (Optimization level)
    //  -g0 (Debug level)
    //  -m2 (Report advisories)
    //-----------------------------------------------------------

All of the supported CUDA platforms use little-endian CPUs, and cudaMemcpy() can copy data structures to the device without knowing the data format, so I would assume the GPU is definitely also little-endian. The GPU might support both big and little endian execution (as some CPUs also do this) as a hedge against future CUDA platforms being big endian.

My guess is the answer has to be either “little-endian” or “both”.

In general, endianess test is easy with code like this

shared unsigned int* value;
*value = 0x03020100;
unsigned char TestByte = ((unsigned char) value);

After execution if TestByte contains 0 then it is little endian else if it contains 3 it is big endian. The fact that byte order differs on BigEndian and LittleEndian is tested just reading the byte value from memory location where int value is stored.
The thing could be complicated on some compilers (not nvcc) which have builtin endianess emulation depend on platform they are installed.
It will produce that previous code return 0 in TestByte on BigEndian platform too if such emulation switch is turned on.

In that situation the only relevant thing is to check assembler file generated by compiler. If it contains something like mov reg,[SomeAddress+3] on place where it load byte value from memory then the lowest byte is stored on higher address (addr+3). If same address, address where integer value is stored, is used when loading byte value from memory, then the lowest byte is stored on lower address.