GTX 460

Any CUDA or OpenCL information and benchmarks on this new product.

From
http://www.anandtech.com/show/3809/nvidias…-the-200-king/3

ECC is gone.
Double precision is 1/6th of the FP32 performance which is better than the 1/8th performance on the GTX470/480.

How will the super-scaler execution affect the CUDA and Opencl compilers.

[url=“GeForce 400 series - Wikipedia”]http://en.wikipedia.org/wiki/GeForce_400_Series[/url]

wiki says it has FP32 running at 1361GFLOPS. Then FP64 is 226.83GFLOPS. That’s way higher than 470’s 136GFLOPS.

So as a 470 owner, I think I was royally screwed. External Image

Looking good. Now I just need to find somewhere selling this 2GB variant from Sparkle:

[url=“http://www.sparkle.com.tw/News/SP460/news_SP460_en.html”]網站存取異常

http://www.anandtech.com/show/3809/nvidias…60-the-200-king

Hmm on the first page it says 1/12 FP32 which doesn’t match

“but the effective execution rate of 1/6th FP32 performance will be enough to effectively program in FP64 and debug as necessary.” on the 3rd page.

From

http://www.legitreviews.com/article/1360/1/

L2 Cache size cut to 386KB (768MB version) and 512KB (1GB version)

No CUDA benchmarks online yet, but we can predict from SP count and MHz pretty well.

What makes me have a BIG SMILE is that power use is only about 150 watts under load, and temps are roughly 65 degrees C, not 90C.
This is very very exciting because it means that a two GF104 chip card is within practical power and heat limits. I am now camping in line at the local NVIDIA store for the now inevitable GTX495, please!

Take a look at this article. [url=“NVIDIA’s GeForce GTX 460: The $200 King”]http://www.anandtech.com/show/3809/nvidias...60-the-200-king[/url]

Seems like they added the ability to issue multiple instructions per-cycle from the same warp, and also added in an extra set of functional units per SM.

Tech Report, AnandTech, FiringSquad, Guru3D, Hardware Canucks, Hardware Heaven, [H]ard|OCP,
Hexus.net, HotHardware, and PC Perspective all have GTX460 reviews.

None of them do a single CUDA test, not even Folding@Home. Sigh.

We’ll just have to do our own experiments!

A prepackaged CUDA benchmark suite for Windows would be nice to push toward these reviewers. Any options?

So with the 50% extra cores we can expect it to perform like a GF100 with 336 SP’s sometimes when the ILP is good? And in the worst case scenario it would be performing like it had 224 SP’s ?

I guess that’s hard to say but it’s also noteworhty that they didn’t increase the on-chip memory resources accordingly either.

Plus the doubling of special function units, which will improve code that depends on those heavily.

It is more marketing there I bet, lets see.

And another thing, double (64 bit floats) is only on 1 of each 3 blocks of cores. Does this impact programming (that is, do you have to include code to allow for this), or does the cuda runtime automatically take care of it?

Together with the artificial crippling of the 64 bit engine, anything needing 64 bit floating math will probably run better on the CPU - or will have to be converted to fixed point maths - assuming one doesn’t have a tesla unit lurking out of earshot.

If anyone else in this thread has trouble getting a GTX 460 to work under Windows XP Prof. 64 bit, please let me know.

Isn’t every GF100 100+ GFlops peak in DP? And maybe GF104 is 100+ GFlops too? That’s higher than any CPU you can get, isn’t it?

Probably, since the imaginary Wikipedia specs for the as yet unreleased Sandy Bridge CPUs put the DP rate for all cores combined at 128 GFLOPS if you use Intel’s new AVX instructions.

That’s what I thought. I’m not trying to be a paid NVIDIA shill (paid NVIDIA, yes, shill, no), but between oodles of memory bandwidth versus CPUs (if you’re not able to just stream from the cache on a CPU) and higher peak DP performance it seems disingenuous to claim that no matter what you’ll be better off running your DP calculations on the CPU.

What’s new in CUDA SM caps 2.1 ?

Different instruction scheduling for GF104.

I’m having serious problems with my new GTX 460. Its seriously underperforming - look at the device to device bandwidth figures below. Shouldn’t these be up around 115GB/s? I’m using Windows XP Pro 64-bit and CUDA 2.3. Any suggestions?

[codebox]Running on…

  device 0:GeForce GTX 460

Shmoo Mode

Host to Device Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)

 1024		286.6

 2048		500.3

 3072		718.1

 4096		917.1

 5120		1092.3

 6144		1268.0

 7168		1433.7

 8192		1599.9

 9216		1720.9

10240		1851.8

11264		1968.9

12288		2097.4

13312		2379.9

14336		2303.2

15360		2396.7

16384		2498.9

17408		2567.0

18432		2664.0

19456		2717.1

20480		2789.5

22528		2944.7

24576		3058.1

26624		3101.9

28672		3287.8

30720		3380.8

32768		3489.0

34816		3542.5

36864		3634.2

38912		3697.9

40960		3921.3

43008		3827.2

45056		3867.4

47104		3922.3

49152		3977.3

51200		4029.3

61440		4256.3

71680		4410.1

81920		4554.9

92160		4630.9

102400 4735.8

204800 5183.5

307200 5346.8

409600 5445.9

512000 5495.1

614400 5539.8

716800 5569.8

819200 5596.2

921600 5605.5

1024000 5616.6

1126400 5594.9

2174976 5655.4

3223552 5678.2

4272128 5690.4

5320704 5698.2

6369280 5700.4

7417856 5706.6

8466432 5710.1

9515008 5710.7

10563584 5713.6

11612160 5714.0

12660736 5715.0

13709312 5716.0

14757888 5718.5

15806464 5718.7

16855040 5720.6

18952192 5725.8

21049344 5727.6

23146496 5726.3

25243648 5726.1

27340800 5727.1

29437952 5727.8

31535104 5731.2

33632256 5728.5

37826560 5716.5

42020864 5729.5

46215168 5730.1

50409472 5726.0

54603776 5723.8

58798080 5704.6

62992384 5730.9

67186688 5731.0

Shmoo Mode

Device to Host Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)

 1024		271.3

 2048		547.9

 3072		788.6

 4096		996.5

 5120		1204.3

 6144		1367.5

 7168		1540.2

 8192		1674.5

 9216		1843.3

10240		1967.6

11264		2078.6

12288		2196.8

13312		2313.3

14336		2392.2

15360		2506.9

16384		2579.4

17408		2671.5

18432		2746.6

19456		2816.1

20480		2883.1

22528		3001.3

24576		3138.1

26624		3245.2

28672		3339.2

30720		3428.9

32768		3512.8

34816		3594.0

36864		3662.1

38912		3730.0

40960		3813.5

43008		3880.5

45056		3932.0

47104		3987.0

49152		4029.8

51200		4064.7

61440		4268.2

71680		4398.3

81920		4511.9

92160		4606.0

102400 4706.6

204800 5095.2

307200 5242.7

409600 5317.3

512000 5351.9

614400 5371.2

716800 5397.8

819200 5415.4

921600 5432.4

1024000 5447.8

1126400 5417.5

2174976 5486.0

3223552 5510.8

4272128 5516.7

5320704 5522.0

6369280 5538.7

7417856 5546.1

8466432 5544.8

9515008 5546.2

10563584 5555.0

11612160 5552.0

12660736 5548.5

13709312 5555.5

14757888 5552.1

15806464 5542.7

16855040 5555.8

18952192 5548.9

21049344 5537.4

23146496 5557.3

25243648 5555.4

27340800 5558.6

29437952 5543.4

31535104 5562.5

33632256 5556.8

37826560 5559.2

42020864 5555.8

46215168 5559.7

50409472 5559.2

54603776 5556.7

58798080 5549.8

62992384 5553.7

67186688 5565.1

Shmoo Mode

Device to Device Bandwidth

Transfer Size (Bytes) Bandwidth(MB/s)

 1024		426.8

 2048		1743.9

 3072		1111.1

 4096		3458.1

 5120		4256.3

 6144		5121.8

 7168		6000.6

 8192		6810.1

 9216		7640.0

10240		8453.6

11264		9109.7

12288		10005.8

13312		10751.5

14336		11547.2

15360		12322.0

16384		12969.0

17408		13779.5

18432		14513.0

19456		15339.5

20480		15873.9

22528		17438.6

24576		18732.0

26624		20189.7

28672		21335.6

30720		19604.4

32768		20580.9

34816		21684.4

36864		23008.0

38912		24034.6

40960		24886.9

43008		26078.1

45056		27292.1

47104		28217.3

49152		29180.2

51200		30155.7

61440		32152.0

71680		34179.7

81920		37159.9

92160		38657.0

102400 40880.9

204800 54133.2

307200 46018.0

409600 49451.2

512000 50828.7

614400 52100.0

716800 52792.1

819200 53621.9

921600 54317.8

1024000 54632.3

1126400 55090.4

2174976 56944.0

3223552 57524.8

4272128 58220.4

5320704 58173.5

6369280 58426.6

7417856 58432.9

8466432 58440.4

9515008 58550.6

10563584 58572.1

11612160 58634.9

12660736 58747.3

13709312 58727.0

14757888 57540.1

15806464 58821.9

16855040 58838.2

18952192 58874.6

21049344 58949.4

23146496 58875.4

25243648 58887.4

27340800 58998.1

29437952 59125.1

31535104 59026.5

33632256 59396.5

37826560 59207.2

42020864 59300.1

46215168 59406.6

50409472 59277.7

54603776 59271.4

58798080 59314.8

62992384 59205.6

67186688 59121.2

&&&& Test PASSED

Press ENTER to exit…

[/codebox]

That’s true of the 100 range, but in 104 range, only 1/3rd of the cores can do DP floating point, so I can’t see how it could better than a theoretical peak of around 25 GFlops of double precision