Tesla P4, P40 Accelerators Deliver 45x Faster AI

[rant]
So none of the new Titan, P6000, P40, P4 do anything for me for AI. I would like fast FP16. how come Nvidia keep saying they are supporting AI when none of their cards have this?

Nvidia, please drop FP64, drop FP32 and make an FP16 chip and I will buy them. Your 1080 has bugs with GDDR5x.

If you are in the business of large scale AI and you are not spending someone else’s money, the GTX 1070 is actually the best model to buy, and to farm them.

The average AI people are stuck on interconnect speed because of BP, but modern post-DL (most people are hung on DL still) designs are embarrassingly parallel. So NVLink is not needed either thanks :/
[/rant]

Tesla P100 variants (SXM2, PCIE) have fast FP16. Those are the recommended processors for fast DL training in an FP16 setting.

You’ll buy tens of thousands of them if they are $6K each?

Nearly all products except P100 today are supported by the graphics space. Dropping FP32 is not feasible in a graphics GPU.

So a processor like that could certainly be produced, but it wouldn’t have the volume demand to support GTX pricing levels. A small amount (1/24 or 1/32) FP64 capability doesn’t meaningfully impact die area or cost, but is useful from a development/compatibility perspective.

Tesla P40 (and P4) have substantial INT8 throughput. This will be useful/meaningful as these processors attempt to add value in the DL inferencing space.

For DL training, especially where FP16 is involved, Tesla P100 is the recommended product.

Yes you are correct and I understand that its a big cost to fab a line not subsidised by graphics. Perhaps though if I want it, there will be big companies out there that want it.

i dont see how the P100 can be recommended.

for $60,000 upfront:
5 x P100 = 93.5TFs FP16
150 x 1070 = 975.0TFs FP16 (cast as FP32).

ongoing costs:
P100 = 16W per TFlop
1070 = 23W per TFlop

the reason people buy the P100s and nvlink is their AI algorithms dont scale so they are trying to squeeze a camel through a hoop. but as we saw with CPUs you can only fit so much processing density in a space and eventually you have to bite the bullet and change your algorithms.

Pascal GPUs doesn’t have significant INT8 speed. Kepler/Maxwell had one single-cycle packed INT8 operation - SAD. Pascal added second one - DP. That’s all

You’re proposing basically scale out, instead of scale up. Except that almost nobody prefers that route, and the reasons have to do with cost benefits as much as anything else. People use scale out when they have to, not when they have alternative (smaller) choices.

As we’ve seen with CPUs, scaling out (more compute processors) rather than scaling up (more power per compute processor) has various issues/costs/tradeoffs associated with it.

First of all, it’s not always a trivial matter to take an algorithm that requires 20TF of compute throughput to meet a particular performance level, and have that algorithm work equally well whether I deliver the 20TF via one chip or via 20-30 chips (using your ratio). In general, from a complexity standpoint, nearly all developers would rather try to write the algorithm using a single chip than the 20-30 chip approach.

Second, all distributed algorithms have communication costs. These vary by algorithm, but ultimately impose scaling limits on nearly all parallel algorithms. Eventually, the additional communication overhead going from, say, 10000 nodes to 20000 nodes results in far less than 2x perf improvement, and there are diminishing returns beyond that.

Third, you haven’t really captured the cost scenario properly. It’s a bit more involved than just the cost of the GPU (or just the cost of the CPU, in a traditional cluster). Neither a CPU nor a GPU lives by itself, and if we talk about real distributed systems, nearly all of them today use cluster computing to achieve performance at scale. So each time you add 2 or 4 CPUs, or each time you add 2-8 GPUs, you are adding the cost of a node to house that. For each node, you are adding an incremental cost in network hardware cost, rackspace cost, and floorspace cost, as well as cooling cost.

In short, almost nobody is happy with a larger system when they can achieve their goal with a smaller system, and if the node ratio we’re talking about here is ~10x as you suggest, the comparison is generally pretty compelling when we scale out to systems of the size people are buying today.

That is one reason why people buy P100s, as opposed to clusters full of GTX cards, and related to the reason why people buy haswell and bulldoze their nehalem clusters out onto the scrap heap.

Almost without exception, developers would prefer it if intel would simply give them a CPU that runs their code 10x as fast. This is hard to do, however, and silicon integration, albeit a powerful tool, hasn’t yielded this very often for serial codes (not as often as Moore’s law would predict, and less often recently). So the resistance is not that “you can only fit so much processing density in a space”. That is simply an incorrect concept, if we take time into account. As time goes on, silicon integration gives us more processing density in a space, approximately consistent with Moore’s law. Developers absolutely prefer more processing density in a given space, rather than having to distribute their system in a larger space. The problem for developers is that in order to achieve the benefit of silicon integration, lately, it’s been necessary to parallelize your code. As time goes on, more parallelism is needed to take advantage of the additional benefits that silicon integration affords.

As an aside, the GPU/CPU wrinkle on this, is that as a parallel processor, the GPU is in its best light. This comes about not through any silicon benefit (NVIDIA, AMD, and others, do not have silicon integration capabilities that are greater than intel’s), but through a significantly different design approach of the GPU vs. the CPU, i.e. in terms of allocation of silicon transistor budget or silicon real estate budget. For massively parallel codes, the GPU becomes a more effective choice than current CPU architectures. Since we are being pushed towards parallelism anyway, for those who are more easily able to adapt to massive parallelism, the GPU becomes on interesting choice.

Pascal added single-instruction packed INT8 multiply and accumulate (vector dot product) IDP2A and IDP4A.

With respect to that type of operation, which is of interest to the DL community, the Pascal GPUs (except P100) have significantly higher throughput than previous GPUs.

txbob provided an excellent summary, to which I want to add just a few additional thoughts:

[1] Communication represents data movement, and data movement is energy-intensive. The biggest challenge in many kinds of high-performance computing at present is the power wall so it is advantageous to minimize communication, and restrict necessary communications to the shortest path possible.

[2] Most real-life workloads do not provide limitless parallelism, and in fact after many years of focusing on adding cores GPUs had reached the limits of feasible parallelization for some of them, providing strong incentive for the cores themselves to become faster. Which is where Pascal delivers relative to Maxwell. As txbob explained, there are multiple other reasons why many people subscribe to Seymour Cray’s insight that it is easier to plow a field with a pair of oxen than 1024 chickens.

[3] Processors for very specialized markets can always be made faster (and/or more efficient) than a general-purpose engine, this is why DE Shaw built Anton and its followups, and why they created PEZY in Japan as a GRAPE followup. But the individual markets addressed by those parts are too small to make a real business out of them.

[4] Re “there will be big companies out there that want it”: I think we can safely assume that those big companies are already in talks with NVIDIA (and from various articles in the trade press we can conclude that they are in fact gobbling up NVIDIA’s current hardware). It is unrealistic to expect NVIDIA to provide one’s “dream machine” unless one if willing to pony up the serious money it takes to make it worthwhile for a business. I don’t know how many R&D dollars NVIDIA spent on Pascal, but already years ago the cost of creating a major new architecture, CPU or GPU, was 500 million dollars and more. You have to be able to amortize those NRE expenses across the expected lifetime/volume of a product.

I agree with all you have said. If algorithms aren’t embarrassingly parallel they suffer diminishing returns. Amdahl’s law, as you mentioned in other words.

AI in the public domain is not scalable. They are tied to fast interconnects. This is why you see supposedly state of the art OpenAI.com receiving a DGX-1 unit. I admit in a way my post is more a subtle boast that I invented scalable AI algorithms that eliminate the need for the dreaded interconnect.

I admit in a way my post is more a subtle boast that I invented scalable AI algorithms that eliminate the need for the dreaded interconnect.

In that case, why not point to your relevant patent(s) or an article in a peer-reviewed publication in a well-regarded journal or conference?

In that case, why not point to your relevant patent(s) or peer-reviewed publication in a well-regarded journal or conference?

[/quote]

Just assume im a dumb kid sorry. These things are proprietary. When money is involved things always get solved a lot quicker. And as with all discoveries, its not something new being invented, just a combination of several existing but disparate pieces of knowledge. Like a suitcase with wheels. took nearly 100 years for someone to put those two ideas together!

Combining existing elements in a novel way is certainly a valid and useful form of innovation. I also understand that not all information can be shared freely and that sometimes trade secrets are the most expedient way to protect intellectual property. I assume that if your idea takes off, we will learn about it sooner or later, one way or the other. Best of luck in your business ventures.