my speedy SGEMM
  4 / 7    
I *think* that the two pairs of add.half instructions are packed into one instruction each or something like that (forget where I read that). I'm not sure whether they issue somehow in parallel too, or if just instruction fetch & decode is improved. That may be a source of misprediction.

My latest code is 86 too (not including hand assembly) -- turns out we both made effectively the same transformations but with very different looking code...

The 86 can be cut down to 85 as follows, if I could just get a assembled cubin to link properly (can't get the kernel to invoke properly using the cuLaunch stuff):

(1) Eliminate use of $ofs4 in second half of loop. Instead, just directly add 0x80 to $ofs2. Then, at the end/start of loop, subtract 0x80 from $ofs2. So we've removed an add and added a subtract (no change).

(2) But we now have $ofs4 free to store the base address of A or B, and can thus eliminate the load from $r30 at the top of the loop. Saves one instruction.

Anyway, enough micro-tweaking -- time to try other approaches to see if we can crack 200. With the global load latency hiding/prefetch trick, it is less important to have multiple blocks in flight which may make more register intensive kernels that look at more data at once worthwhile.

- Paul
I *think* that the two pairs of add.half instructions are packed into one instruction each or something like that (forget where I read that). I'm not sure whether they issue somehow in parallel too, or if just instruction fetch & decode is improved. That may be a source of misprediction.



My latest code is 86 too (not including hand assembly) -- turns out we both made effectively the same transformations but with very different looking code...



The 86 can be cut down to 85 as follows, if I could just get a assembled cubin to link properly (can't get the kernel to invoke properly using the cuLaunch stuff):



(1) Eliminate use of $ofs4 in second half of loop. Instead, just directly add 0x80 to $ofs2. Then, at the end/start of loop, subtract 0x80 from $ofs2. So we've removed an add and added a subtract (no change).



(2) But we now have $ofs4 free to store the base address of A or B, and can thus eliminate the load from $r30 at the top of the loop. Saves one instruction.



Anyway, enough micro-tweaking -- time to try other approaches to see if we can crack 200. With the global load latency hiding/prefetch trick, it is less important to have multiple blocks in flight which may make more register intensive kernels that look at more data at once worthwhile.



- Paul

#46
Posted 01/16/2008 09:22 AM   
Hi!

I've tried nearly the same approach to get better results for cublasCgemm some time ago using CUDA/CUBLAS 1.0. Unfortunately, the resulting timings were much slower than the cublasCgemm.
Now I am really surprised that this approach works that fast for float matrices within CUDA/CUBLAS 1.1. But the benchmarks of my similar approach dealing with complex numbers did not significantly change and it is still slower than the cublasCgemm. I also had no luck to adjust your code to work with complex matrices to outperform cublasCgemm. Have you ever tried it?!
Hi!



I've tried nearly the same approach to get better results for cublasCgemm some time ago using CUDA/CUBLAS 1.0. Unfortunately, the resulting timings were much slower than the cublasCgemm.

Now I am really surprised that this approach works that fast for float matrices within CUDA/CUBLAS 1.1. But the benchmarks of my similar approach dealing with complex numbers did not significantly change and it is still slower than the cublasCgemm. I also had no luck to adjust your code to work with complex matrices to outperform cublasCgemm. Have you ever tried it?!

#47
Posted 01/19/2008 01:38 PM   
I think that achieving high performance in CGEMM should be easier than doing it in SGEMM, since CGEMM has higher computational intensity (2x more data but 4x more flops). I didn't try it but I guess it won't work well with 32x32 block size in matrix C. Probably, smaller blocks like 32x16 may give better performance. I should note that it took a couple of days to try different block sizes and coding techniques and analyze disassembler outputs until I made my SGEMM work fast.

Thinking of non-square blocks in matrix C I tried some today and finally got 205 Gflop/s with 64x16 block size. This block size allows keeping A's blocks entirely in the registers, thereby avoiding the shared memory overheads (like moving data from registers to the shared memory and back). Also, it reduced consumption of the shared memory, that made it possible to achieve 206 Gflop/s in sgemm( 'N', 'N', ... ). (That in turn improved my LU with partial pivoting to 168 Gflop/s.)
I think that achieving high performance in CGEMM should be easier than doing it in SGEMM, since CGEMM has higher computational intensity (2x more data but 4x more flops). I didn't try it but I guess it won't work well with 32x32 block size in matrix C. Probably, smaller blocks like 32x16 may give better performance. I should note that it took a couple of days to try different block sizes and coding techniques and analyze disassembler outputs until I made my SGEMM work fast.



Thinking of non-square blocks in matrix C I tried some today and finally got 205 Gflop/s with 64x16 block size. This block size allows keeping A's blocks entirely in the registers, thereby avoiding the shared memory overheads (like moving data from registers to the shared memory and back). Also, it reduced consumption of the shared memory, that made it possible to achieve 206 Gflop/s in sgemm( 'N', 'N', ... ). (That in turn improved my LU with partial pivoting to 168 Gflop/s.)

#48
Posted 01/20/2008 10:02 AM   
Great work guys. Regarding your modeling of the instruction costs...

How many warps are active in your code? I ask because if you have fewer than 6 warps active (192 threads), you won't fully hide the arithmetic pipeline latency with back-to-back dependent arithmetic instructions.

The latency is approximately 22 clocks (this is the 1.35 GHz clock on 8800 GTX), and it takes 4 clocks to execute an arithmetic instruction (ADD, MUL, MAD, etc,) for a whole warp. If you have back to back MAD instructions with shared memory operands, then the hardware instruction set does, as you have surmised, need extra instructions for storing results to shared memory and for preparing shared memory addresses for input operands. So those instructions aren't likely to actually be back to back, so you could possibly hide the latency with 3 warps minimum. However, if you have back-to-back dependent instructions with only register operands they really can be back to back, so:

MAD r3, r1, r1, r2
MAD r1, r3, r2, r4

Then you have to have enough warps to cover the full ~22 clocks latency or you will stall.

This is just one example to show you how a simplistic performance model like "6 cycles per shared memory MAD" can only ever be approximate.

Mark
Great work guys. Regarding your modeling of the instruction costs...



How many warps are active in your code? I ask because if you have fewer than 6 warps active (192 threads), you won't fully hide the arithmetic pipeline latency with back-to-back dependent arithmetic instructions.



The latency is approximately 22 clocks (this is the 1.35 GHz clock on 8800 GTX), and it takes 4 clocks to execute an arithmetic instruction (ADD, MUL, MAD, etc,) for a whole warp. If you have back to back MAD instructions with shared memory operands, then the hardware instruction set does, as you have surmised, need extra instructions for storing results to shared memory and for preparing shared memory addresses for input operands. So those instructions aren't likely to actually be back to back, so you could possibly hide the latency with 3 warps minimum. However, if you have back-to-back dependent instructions with only register operands they really can be back to back, so:



MAD r3, r1, r1, r2

MAD r1, r3, r2, r4



Then you have to have enough warps to cover the full ~22 clocks latency or you will stall.



This is just one example to show you how a simplistic performance model like "6 cycles per shared memory MAD" can only ever be approximate.



Mark

#49
Posted 01/22/2008 02:09 PM   
Mark, thanks for the discussion.
[quote name='Mark Harris' date='Jan 22 2008, 06:09 AM']How many warps are active in your code?  I ask because if you have fewer than 6 warps active (192 threads), you won't fully hide the arithmetic pipeline latency with back-to-back dependent arithmetic instructions.[right][snapback]313292[/snapback][/right][/quote]In sgemm I have 2 warps per thread block and up to 4 thread blocks per multi-processor. So, it should be enough. Also, all MADs come in blocks of 16 independent instructions, which means there should be no back-to-back dependencies (only RARs, which are usually safe).

I assume that there is no penaly for outputting to the same registers that are used as input in the same instruction, like in MAD r1, r1, r2, r3.

Also, in the microbenchmark that I cited, I have same dependencies in the code that uses the shared memory and in the code that does not. Both codes are run with 256 threads per thread block. Still they get 229 vs. 338 Gflop/s correspondingly.

[quote name='Mark Harris' date='Jan 22 2008, 06:09 AM']If you have back to back MAD instructions with shared memory operands, then the hardware instruction set does, as you have surmised, need extra instructions for storing results to shared memory and for preparing shared memory addresses for input operands.
[right][snapback]313292[/snapback][/right][/quote]But I don't write to the shared memory in MADs. I have writes that store the data I fetch from the global memory, but there are only a few of them per dozens of MADs. Also, I count all of the extra instructions used for preparing the address (that I can see with decuda) when doing the estimates.

I'd like to see a code that achieves >230 Gflop/s in shared memory MADs. Under "shared memory MADs" I mean instructions like "mad.rn.f32 $r1, s[0x0010], $r2, $r3" if using the decuda notation.

I attached the code that runs at up to >200 Gflop/s on GeForce 8800 GTX in both sgemm( 'N', 'N', ... ) and sgemm( 'N', 'T', ... ).
Mark, thanks for the discussion.

[quote name='Mark Harris' date='Jan 22 2008, 06:09 AM']How many warps are active in your code?  I ask because if you have fewer than 6 warps active (192 threads), you won't fully hide the arithmetic pipeline latency with back-to-back dependent arithmetic instructions.
[snapback]313292[/snapback]
In sgemm I have 2 warps per thread block and up to 4 thread blocks per multi-processor. So, it should be enough. Also, all MADs come in blocks of 16 independent instructions, which means there should be no back-to-back dependencies (only RARs, which are usually safe).



I assume that there is no penaly for outputting to the same registers that are used as input in the same instruction, like in MAD r1, r1, r2, r3.



Also, in the microbenchmark that I cited, I have same dependencies in the code that uses the shared memory and in the code that does not. Both codes are run with 256 threads per thread block. Still they get 229 vs. 338 Gflop/s correspondingly.



[quote name='Mark Harris' date='Jan 22 2008, 06:09 AM']If you have back to back MAD instructions with shared memory operands, then the hardware instruction set does, as you have surmised, need extra instructions for storing results to shared memory and for preparing shared memory addresses for input operands.

[snapback]313292[/snapback]
But I don't write to the shared memory in MADs. I have writes that store the data I fetch from the global memory, but there are only a few of them per dozens of MADs. Also, I count all of the extra instructions used for preparing the address (that I can see with decuda) when doing the estimates.



I'd like to see a code that achieves >230 Gflop/s in shared memory MADs. Under "shared memory MADs" I mean instructions like "mad.rn.f32 $r1, s[0x0010], $r2, $r3" if using the decuda notation.



I attached the code that runs at up to >200 Gflop/s on GeForce 8800 GTX in both sgemm( 'N', 'N', ... ) and sgemm( 'N', 'T', ... ).

#50
Posted 01/24/2008 11:21 AM   
[quote name='vvolkov' date='Jan 24 2008, 12:21 PM']But I don't write to the shared memory in MADs. I have writes that store the data I fetch from the global memory, but there are only a few of them per dozens of MADs. Also, I count all of the extra instructions used for preparing the address (that I can see with decuda) when doing the estimates.
[right][snapback]314014[/snapback][/right]
[/quote]
Well, like I said: :)
[quote name='Mark Harris']If you have ... instructions with shared memory operands, then the hardware instruction set does, as you have surmised, need extra instructions for storing results to shared memory [b]and for preparing shared memory addresses for input operands[/b]. [/quote]
So I believe that in your code:
[code]c[0] += a*b[0];
c[1] += a*b[1];
c[2] += a*b[2];
c[3] += a*b[3];
c[4] += a*b[4];
c[5] += a*b[5];
c[6] += a*b[6];
c[7] += a*b[7];
c[8] += a*b[8];
// etc.[/code]
Each of those lines generates two instructions (ignoring possible store to c[] depending on where the compiler has put it): the mad and an address update. You should be able to see this in the decuda output, right? Is that enough to explain why perf is lower with shared memory operands?

Mark
[quote name='vvolkov' date='Jan 24 2008, 12:21 PM']But I don't write to the shared memory in MADs. I have writes that store the data I fetch from the global memory, but there are only a few of them per dozens of MADs. Also, I count all of the extra instructions used for preparing the address (that I can see with decuda) when doing the estimates.

[snapback]314014[/snapback]




Well, like I said: :)

Mark Harris said:If you have ... instructions with shared memory operands, then the hardware instruction set does, as you have surmised, need extra instructions for storing results to shared memory and for preparing shared memory addresses for input operands.


So I believe that in your code:

c[0] += a*b[0];

c[1] += a*b[1];

c[2] += a*b[2];

c[3] += a*b[3];

c[4] += a*b[4];

c[5] += a*b[5];

c[6] += a*b[6];

c[7] += a*b[7];

c[8] += a*b[8];

// etc.


Each of those lines generates two instructions (ignoring possible store to c[] depending on where the compiler has put it): the mad and an address update. You should be able to see this in the decuda output, right? Is that enough to explain why perf is lower with shared memory operands?



Mark

#51
Posted 01/24/2008 04:34 PM   
Vasily and Paul,
great work on SGEMM!!!

Massimiliano
Vasily and Paul,

great work on SGEMM!!!



Massimiliano

#52
Posted 01/24/2008 04:38 PM   
Hi Mark,

According to the disassembly output, there is only ~*one* instruction per MAD from the saxpy code. The c[i] values are each in their own register, the a value is a register, and the b[i] reference into a shared memory by a direct offset off an $ofs register. If the b[i] values were seperated by more 0x80 bytes, then the $ofs register would need to be updated between each MAD, which would introduce additional instructions -- but that's not the case here. The code has been very carefully structured by Vasily to ensure that (a) most operands are in registers and (B) that nearly all references into the shared memory do not require a load/update of the $ofs register.

The result is an inner kernel (for previous versions of the code) with 64 MAD instructions of the form:

load r16 with a[0]
MAD r0, s[$ofs1 + 0x00], r16, r0
MAD r1, s[$ofs1 + 0x04], r16, r1
MAD r2, s[$ofs1 + 0x08], r16, r2
...
MAD r15, s[$ofs1 + 0x3C], r16, r15
load r17 with a[1]
MAD r0, s[$ofs1 + 0x40], r17, r0
MAD r1, s[$ofs1 + 0x44], r17, r1
...
increment $ofs1 by 0x80
load r18 with a[2]
MAD r0, s[$ofs1 + 0x00], r18, r0
etc.

Or something like that (don't have the code handy right now). So there are no dependent reads/writes from one MAD to the next. Nor are there any $ofs updates. It just seems like any MAD with a shared memory operand, regardless of # of active warps or # of blocks, has a 33% lower throughput than one using a register operand.

If you'd like, I can post a simple kernel that demonstrates the reduction in throughput as a function of # of threads. That kernel shows the 22 cycle latency and demonstrates that indeed the processor can fully hide the 22 cycle latency after ~192 threads, but that the shared-memory MAD is capped at 66% of the throughput of an all-register MAD.

Regards,

- Paul
Hi Mark,



According to the disassembly output, there is only ~*one* instruction per MAD from the saxpy code. The c[i] values are each in their own register, the a value is a register, and the b[i] reference into a shared memory by a direct offset off an $ofs register. If the b[i] values were seperated by more 0x80 bytes, then the $ofs register would need to be updated between each MAD, which would introduce additional instructions -- but that's not the case here. The code has been very carefully structured by Vasily to ensure that (a) most operands are in registers and (B) that nearly all references into the shared memory do not require a load/update of the $ofs register.



The result is an inner kernel (for previous versions of the code) with 64 MAD instructions of the form:



load r16 with a[0]

MAD r0, s[$ofs1 + 0x00], r16, r0

MAD r1, s[$ofs1 + 0x04], r16, r1

MAD r2, s[$ofs1 + 0x08], r16, r2

...

MAD r15, s[$ofs1 + 0x3C], r16, r15

load r17 with a[1]

MAD r0, s[$ofs1 + 0x40], r17, r0

MAD r1, s[$ofs1 + 0x44], r17, r1

...

increment $ofs1 by 0x80

load r18 with a[2]

MAD r0, s[$ofs1 + 0x00], r18, r0

etc.



Or something like that (don't have the code handy right now). So there are no dependent reads/writes from one MAD to the next. Nor are there any $ofs updates. It just seems like any MAD with a shared memory operand, regardless of # of active warps or # of blocks, has a 33% lower throughput than one using a register operand.



If you'd like, I can post a simple kernel that demonstrates the reduction in throughput as a function of # of threads. That kernel shows the 22 cycle latency and demonstrates that indeed the processor can fully hide the 22 cycle latency after ~192 threads, but that the shared-memory MAD is capped at 66% of the throughput of an all-register MAD.



Regards,



- Paul

#53
Posted 01/24/2008 06:09 PM   
I got 205 Gflop/s on my 8800GTX /w00t.gif' class='bbc_emoticon' alt=':w00t:' /> !!
(vs 115 for CUBLAS)

Great work !

Maybe one of you can answer this question:
My application requires matrix-vector products: C=A*B with:
C[N,1], A[N,2N] and B[2N,1]

(!) important: A is constant, so host->device transfer is not that critical

My dream is to reach 50microsec with N=1000,
but cublas SGEMV is ~10 times too slow

do you think it is possible to build an optimized sgemv taking advantage of
the fact that A is constant ?

cheers,
J
I got 205 Gflop/s on my 8800GTX /w00t.gif' class='bbc_emoticon' alt=':w00t:' /> !!

(vs 115 for CUBLAS)



Great work !



Maybe one of you can answer this question:

My application requires matrix-vector products: C=A*B with:

C[N,1], A[N,2N] and B[2N,1]



(!) important: A is constant, so host->device transfer is not that critical



My dream is to reach 50microsec with N=1000,

but cublas SGEMV is ~10 times too slow



do you think it is possible to build an optimized sgemv taking advantage of

the fact that A is constant ?



cheers,

J

#54
Posted 01/29/2008 02:38 PM   
[quote name='julien38' date='Jan 29 2008, 06:38 AM']My application requires matrix-vector products: C=A*B with:
  C[N,1], A[N,2N] and B[2N,1]

(!) important: A is constant, so host->device transfer is not that critical

My dream is to reach 50microsec with N=1000,
but cublas SGEMV is ~10 times too slow
[right][snapback]316746[/snapback][/right]
[/quote]I get about 15 microsec latency in memcpy to/from the GPU memory, that gives you 30 us overhead for transferring B and C only. Then you need to read A (8000000 bytes) from the GPU memory at least once, which is 114 us if reading at 70 GB/s, the max sustained read bandwidth on this card. So, 114+30 ~ 150 microseconds should be a more realistic goal.

However, getting even this number may be tricky. A naive implementation would create 1000 threads and lose, because this may be not enough to hide the global memory latency, giving you only a fraction of those 70 GB/s. The solution I'd think of is splitting A into block columns of size N x K. Then you do a few SGEMVs to get matrix C[N,N/K]. You transfer it to the CPU and finish the reduction there to get C[N,1]. The catch is that you can do those few SGEMVs at the same time, so that you have enough threads to hide the latency.
[quote name='julien38' date='Jan 29 2008, 06:38 AM']My application requires matrix-vector products: C=A*B with:

  C[N,1], A[N,2N] and B[2N,1]



(!) important: A is constant, so host->device transfer is not that critical



My dream is to reach 50microsec with N=1000,

but cublas SGEMV is ~10 times too slow

[snapback]316746[/snapback]


I get about 15 microsec latency in memcpy to/from the GPU memory, that gives you 30 us overhead for transferring B and C only. Then you need to read A (8000000 bytes) from the GPU memory at least once, which is 114 us if reading at 70 GB/s, the max sustained read bandwidth on this card. So, 114+30 ~ 150 microseconds should be a more realistic goal.



However, getting even this number may be tricky. A naive implementation would create 1000 threads and lose, because this may be not enough to hide the global memory latency, giving you only a fraction of those 70 GB/s. The solution I'd think of is splitting A into block columns of size N x K. Then you do a few SGEMVs to get matrix C[N,N/K]. You transfer it to the CPU and finish the reduction there to get C[N,1]. The catch is that you can do those few SGEMVs at the same time, so that you have enough threads to hide the latency.

#55
Posted 01/29/2008 07:05 PM   
Paul,

First of all, Congrats on your speedup!!! Its good to know.

I know nothing about SGEMM. But still, I had a look at your code.

I have a few suggestions. Pardon me, if my ideas r in-appropriate for SGEMM.

1) When you have multi-dimensional blocks (16 x 4, for instance), the way threads are mapped into WARPs might NOT be straightforward. There have been some experiments that have been conducted which shows this clearly.

See the 2 FORUM entries:
[url="http://forums.nvidia.com/index.php?showtopic=58052"]http://forums.nvidia.com/index.php?showtopic=58052[/url]
[url="http://forums.nvidia.com/index.php?showtopic=57779"]http://forums.nvidia.com/index.php?showtopic=57779[/url] – A link mentioned in the URL above

2)
In my experience, I have found that having just 32 threads per block eliminates need for __syncthreads(), double-buffering and so on. As long as you schedule enough blocks (6 blocks per multiprocessor), you should be cool. Usually, it results in a 1.3x speedup.
Check out: [url="http://forums.nvidia.com/index.php?showtopic=54875"]http://forums.nvidia.com/index.php?showtopic=54875[/url]

But I am not sure if the second idea would help you. If your code is dependent on the block dimensions then it may not be possible to re-code.
Paul,



First of all, Congrats on your speedup!!! Its good to know.



I know nothing about SGEMM. But still, I had a look at your code.



I have a few suggestions. Pardon me, if my ideas r in-appropriate for SGEMM.



1) When you have multi-dimensional blocks (16 x 4, for instance), the way threads are mapped into WARPs might NOT be straightforward. There have been some experiments that have been conducted which shows this clearly.



See the 2 FORUM entries:

http://forums.nvidia.com/index.php?showtopic=58052

http://forums.nvidia.com/index.php?showtopic=57779 – A link mentioned in the URL above



2)

In my experience, I have found that having just 32 threads per block eliminates need for __syncthreads(), double-buffering and so on. As long as you schedule enough blocks (6 blocks per multiprocessor), you should be cool. Usually, it results in a 1.3x speedup.

Check out: http://forums.nvidia.com/index.php?showtopic=54875



But I am not sure if the second idea would help you. If your code is dependent on the block dimensions then it may not be possible to re-code.

Ignorance Rules; Knowledge Liberates!

#56
Posted 01/31/2008 08:02 AM   
Hi,

I downloaded sgemmN_012408.zip but only managed ~27Gflops/sec for your SGEMM, versus ~120Gflops/sec for the BLAS version. There is zero error/difference for each run. I'm using version 1.0 of the SDK, and have an 8800GTX. Where have I gone wrong?

Kind Regards,
Paul
Hi,



I downloaded sgemmN_012408.zip but only managed ~27Gflops/sec for your SGEMM, versus ~120Gflops/sec for the BLAS version. There is zero error/difference for each run. I'm using version 1.0 of the SDK, and have an 8800GTX. Where have I gone wrong?



Kind Regards,

Paul

#57
Posted 04/03/2008 11:36 AM   
[quote name='pkeir' date='Apr 3 2008, 03:36 AM']I downloaded sgemmN_012408.zip but only managed ~27Gflops/sec for your SGEMM, versus ~120Gflops/sec for the BLAS version. I'm using version 1.0 of the SDK, and have an 8800GTX. Where have I gone wrong?
[right][snapback]355897[/snapback][/right]
[/quote]

Try using SDK 1.1. Its performance is strongly compiler-dependent.

Also, I'd recommend compiling it into 32-bit if you are using 64-bit system. Otherwise you might lose ~10% of performance.

Vasily
[quote name='pkeir' date='Apr 3 2008, 03:36 AM']I downloaded sgemmN_012408.zip but only managed ~27Gflops/sec for your SGEMM, versus ~120Gflops/sec for the BLAS version. I'm using version 1.0 of the SDK, and have an 8800GTX. Where have I gone wrong?

[snapback]355897[/snapback]






Try using SDK 1.1. Its performance is strongly compiler-dependent.



Also, I'd recommend compiling it into 32-bit if you are using 64-bit system. Otherwise you might lose ~10% of performance.



Vasily

#58
Posted 04/03/2008 11:43 AM   
Thanks, of course 1.1 kills all my cygwin bash scripts, but hey: 206 Gflops on my 8800GTX. Very impressive!
Thanks, of course 1.1 kills all my cygwin bash scripts, but hey: 206 Gflops on my 8800GTX. Very impressive!

#59
Posted 04/07/2008 09:18 AM   
I found that ssyrk in CUBLAS 2.0 runs on 8800GTX at up to 127 Gflop/s only. Apparently it is based on the old sgemm. Here is an example that runs as fast as the new sgemm, i.e. at up to 200+ Gflop/s.
I found that ssyrk in CUBLAS 2.0 runs on 8800GTX at up to 127 Gflop/s only. Apparently it is based on the old sgemm. Here is an example that runs as fast as the new sgemm, i.e. at up to 200+ Gflop/s.

#60
Posted 05/02/2008 01:44 PM   
  4 / 7    
Scroll To Top