Use of Texture and Constant memory in Fermi Architecture
With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?
As the loads and stores to global memory are cached, do we get any performance gain using constant and texture memory?
Please correct my understanding if wrong...

Thanks,
Sai
With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?

As the loads and stores to global memory are cached, do we get any performance gain using constant and texture memory?

Please correct my understanding if wrong...



Thanks,

Sai

#1
Posted 10/30/2010 06:27 AM   
With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?
As the loads and stores to global memory are cached, do we get any performance gain using constant and texture memory?
Please correct my understanding if wrong...

Thanks,
Sai
With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?

As the loads and stores to global memory are cached, do we get any performance gain using constant and texture memory?

Please correct my understanding if wrong...



Thanks,

Sai

#2
Posted 10/30/2010 06:27 AM   
[quote name='Sai@NCSU' post='1139349' date='Oct 30 2010, 06:27 AM']With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?

Thanks,
Sai[/quote]
Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.
[code]immediate constant
set $p0 ne u32 $r4 -0x1

constant buffer
add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory
ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset
ld b32 $r17 c2[$r17(null)+0x20][/code]

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.
[quote name='Sai@NCSU' post='1139349' date='Oct 30 2010, 06:27 AM']With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?



Thanks,

Sai

Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.

immediate constant

set $p0 ne u32 $r4 -0x1



constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]



global memory

ld b32 $r4 ca g[$r12(null)+0]



constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]




Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

#3
Posted 10/31/2010 10:38 PM   
[quote name='Sai@NCSU' post='1139349' date='Oct 30 2010, 06:27 AM']With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?

Thanks,
Sai[/quote]
Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.
[code]immediate constant
set $p0 ne u32 $r4 -0x1

constant buffer
add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory
ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset
ld b32 $r17 c2[$r17(null)+0x20][/code]

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.
[quote name='Sai@NCSU' post='1139349' date='Oct 30 2010, 06:27 AM']With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?



Thanks,

Sai

Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.

immediate constant

set $p0 ne u32 $r4 -0x1



constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]



global memory

ld b32 $r4 ca g[$r12(null)+0]



constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]




Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

#4
Posted 10/31/2010 10:38 PM   
Hi Alexander,
Thanks for reply...
Your first reply says "Constant memory access with known offest is essentially free". You mean to say that the constant arguments in instructions are stored in constant memory.
Could you please elaborate on that point?

[quote name='AlexanderMalishev' post='1140085' date='Oct 31 2010, 06:38 PM']Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.
[code]immediate constant
set $p0 ne u32 $r4 -0x1

constant buffer
add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory
ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset
ld b32 $r17 c2[$r17(null)+0x20][/code]

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.[/quote]
Hi Alexander,

Thanks for reply...

Your first reply says "Constant memory access with known offest is essentially free". You mean to say that the constant arguments in instructions are stored in constant memory.

Could you please elaborate on that point?



[quote name='AlexanderMalishev' post='1140085' date='Oct 31 2010, 06:38 PM']Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.

immediate constant

set $p0 ne u32 $r4 -0x1



constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]



global memory

ld b32 $r4 ca g[$r12(null)+0]



constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]




Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

#5
Posted 11/01/2010 01:35 AM   
Hi Alexander,
Thanks for reply...
Your first reply says "Constant memory access with known offest is essentially free". You mean to say that the constant arguments in instructions are stored in constant memory.
Could you please elaborate on that point?

[quote name='AlexanderMalishev' post='1140085' date='Oct 31 2010, 06:38 PM']Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.
[code]immediate constant
set $p0 ne u32 $r4 -0x1

constant buffer
add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory
ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset
ld b32 $r17 c2[$r17(null)+0x20][/code]

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.[/quote]
Hi Alexander,

Thanks for reply...

Your first reply says "Constant memory access with known offest is essentially free". You mean to say that the constant arguments in instructions are stored in constant memory.

Could you please elaborate on that point?



[quote name='AlexanderMalishev' post='1140085' date='Oct 31 2010, 06:38 PM']Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.

immediate constant

set $p0 ne u32 $r4 -0x1



constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]



global memory

ld b32 $r4 ca g[$r12(null)+0]



constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]




Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

#6
Posted 11/01/2010 01:35 AM   
[quote name='Sai@NCSU' post='1140164' date='Nov 1 2010, 01:35 AM']Hi Alexander,
Thanks for reply...
Your first reply says "Constant memory access with known offest is essentially free". You mean to say that the constant arguments in instructions are stored in constant memory.
Could you please elaborate on that point?[/quote]
There are two cases:
1. Constant argument in instruction is the part of the instruction. So it stored in the instruction cache. Example: mul.f32 r0 r1 3.1415
2. Constant argument could be loaded from constant memory. Example: mul.f32 r0 r1 c0[10]

I mean what you don't need to use separate instruction to load data from constant memory. Almost every instruction could load data from constant memory.

For example, to add value from constant memory you need just one instruction:
add.f32 r0 r0 c2[0x35]

To add value from global memory you need two instructions:
ld.u32 r1 g[0x35]
add r0 r0 r1
[quote name='Sai@NCSU' post='1140164' date='Nov 1 2010, 01:35 AM']Hi Alexander,

Thanks for reply...

Your first reply says "Constant memory access with known offest is essentially free". You mean to say that the constant arguments in instructions are stored in constant memory.

Could you please elaborate on that point?

There are two cases:

1. Constant argument in instruction is the part of the instruction. So it stored in the instruction cache. Example: mul.f32 r0 r1 3.1415

2. Constant argument could be loaded from constant memory. Example: mul.f32 r0 r1 c0[10]



I mean what you don't need to use separate instruction to load data from constant memory. Almost every instruction could load data from constant memory.



For example, to add value from constant memory you need just one instruction:

add.f32 r0 r0 c2[0x35]



To add value from global memory you need two instructions:

ld.u32 r1 g[0x35]

add r0 r0 r1

#7
Posted 11/01/2010 06:05 AM   
[quote name='Sai@NCSU' post='1140164' date='Nov 1 2010, 01:35 AM']Hi Alexander,
Thanks for reply...
Your first reply says "Constant memory access with known offest is essentially free". You mean to say that the constant arguments in instructions are stored in constant memory.
Could you please elaborate on that point?[/quote]
There are two cases:
1. Constant argument in instruction is the part of the instruction. So it stored in the instruction cache. Example: mul.f32 r0 r1 3.1415
2. Constant argument could be loaded from constant memory. Example: mul.f32 r0 r1 c0[10]

I mean what you don't need to use separate instruction to load data from constant memory. Almost every instruction could load data from constant memory.

For example, to add value from constant memory you need just one instruction:
add.f32 r0 r0 c2[0x35]

To add value from global memory you need two instructions:
ld.u32 r1 g[0x35]
add r0 r0 r1
[quote name='Sai@NCSU' post='1140164' date='Nov 1 2010, 01:35 AM']Hi Alexander,

Thanks for reply...

Your first reply says "Constant memory access with known offest is essentially free". You mean to say that the constant arguments in instructions are stored in constant memory.

Could you please elaborate on that point?

There are two cases:

1. Constant argument in instruction is the part of the instruction. So it stored in the instruction cache. Example: mul.f32 r0 r1 3.1415

2. Constant argument could be loaded from constant memory. Example: mul.f32 r0 r1 c0[10]



I mean what you don't need to use separate instruction to load data from constant memory. Almost every instruction could load data from constant memory.



For example, to add value from constant memory you need just one instruction:

add.f32 r0 r0 c2[0x35]



To add value from global memory you need two instructions:

ld.u32 r1 g[0x35]

add r0 r0 r1

#8
Posted 11/01/2010 06:05 AM   
got it thank you
got it thank you

#9
Posted 11/03/2010 03:34 AM   
got it thank you
got it thank you

#10
Posted 11/03/2010 03:34 AM   
[quote name='AlexanderMalishev' date='31 October 2010 - 05:38 PM' timestamp='1288564724' post='1140085']
Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.
[/quote]

Hi Malishev, do you known load granularity for constant cache memory?

Thanks,
Tuan
[quote name='AlexanderMalishev' date='31 October 2010 - 05:38 PM' timestamp='1288564724' post='1140085']

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.





Hi Malishev, do you known load granularity for constant cache memory?



Thanks,

Tuan

#11
Posted 11/22/2010 03:12 PM   
[quote name='AlexanderMalishev' date='31 October 2010 - 05:38 PM' timestamp='1288564724' post='1140085']
Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.
[/quote]

Hi Malishev, do you known load granularity for constant cache memory?

Thanks,
Tuan
[quote name='AlexanderMalishev' date='31 October 2010 - 05:38 PM' timestamp='1288564724' post='1140085']

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.





Hi Malishev, do you known load granularity for constant cache memory?



Thanks,

Tuan

#12
Posted 11/22/2010 03:12 PM   
[quote name='minhtuan' date='22 November 2010 - 03:12 PM' timestamp='1290438761' post='1149984']
Hi Malishev, do you known load granularity for constant cache memory?

Thanks,
Tuan
[/quote]

I don't know.
Simple benchmark could reveal a lots of detail (see for example "Demystifying GPU Microarchitecture through Microbenchmarking" paper). But currently I don't have enough time to study it.
[quote name='minhtuan' date='22 November 2010 - 03:12 PM' timestamp='1290438761' post='1149984']

Hi Malishev, do you known load granularity for constant cache memory?



Thanks,

Tuan





I don't know.

Simple benchmark could reveal a lots of detail (see for example "Demystifying GPU Microarchitecture through Microbenchmarking" paper). But currently I don't have enough time to study it.

#13
Posted 11/27/2010 07:27 AM   
[quote name='minhtuan' date='22 November 2010 - 03:12 PM' timestamp='1290438761' post='1149984']
Hi Malishev, do you known load granularity for constant cache memory?

Thanks,
Tuan
[/quote]

I don't know.
Simple benchmark could reveal a lots of detail (see for example "Demystifying GPU Microarchitecture through Microbenchmarking" paper). But currently I don't have enough time to study it.
[quote name='minhtuan' date='22 November 2010 - 03:12 PM' timestamp='1290438761' post='1149984']

Hi Malishev, do you known load granularity for constant cache memory?



Thanks,

Tuan





I don't know.

Simple benchmark could reveal a lots of detail (see for example "Demystifying GPU Microarchitecture through Microbenchmarking" paper). But currently I don't have enough time to study it.

#14
Posted 11/27/2010 07:27 AM   
Scroll To Top