You’re proposing basically scale out, instead of scale up. Except that almost nobody prefers that route, and the reasons have to do with cost benefits as much as anything else. People use scale out when they have to, not when they have alternative (smaller) choices.
As we’ve seen with CPUs, scaling out (more compute processors) rather than scaling up (more power per compute processor) has various issues/costs/tradeoffs associated with it.
First of all, it’s not always a trivial matter to take an algorithm that requires 20TF of compute throughput to meet a particular performance level, and have that algorithm work equally well whether I deliver the 20TF via one chip or via 20-30 chips (using your ratio). In general, from a complexity standpoint, nearly all developers would rather try to write the algorithm using a single chip than the 20-30 chip approach.
Second, all distributed algorithms have communication costs. These vary by algorithm, but ultimately impose scaling limits on nearly all parallel algorithms. Eventually, the additional communication overhead going from, say, 10000 nodes to 20000 nodes results in far less than 2x perf improvement, and there are diminishing returns beyond that.
Third, you haven’t really captured the cost scenario properly. It’s a bit more involved than just the cost of the GPU (or just the cost of the CPU, in a traditional cluster). Neither a CPU nor a GPU lives by itself, and if we talk about real distributed systems, nearly all of them today use cluster computing to achieve performance at scale. So each time you add 2 or 4 CPUs, or each time you add 2-8 GPUs, you are adding the cost of a node to house that. For each node, you are adding an incremental cost in network hardware cost, rackspace cost, and floorspace cost, as well as cooling cost.
In short, almost nobody is happy with a larger system when they can achieve their goal with a smaller system, and if the node ratio we’re talking about here is ~10x as you suggest, the comparison is generally pretty compelling when we scale out to systems of the size people are buying today.
That is one reason why people buy P100s, as opposed to clusters full of GTX cards, and related to the reason why people buy haswell and bulldoze their nehalem clusters out onto the scrap heap.
Almost without exception, developers would prefer it if intel would simply give them a CPU that runs their code 10x as fast. This is hard to do, however, and silicon integration, albeit a powerful tool, hasn’t yielded this very often for serial codes (not as often as Moore’s law would predict, and less often recently). So the resistance is not that “you can only fit so much processing density in a space”. That is simply an incorrect concept, if we take time into account. As time goes on, silicon integration gives us more processing density in a space, approximately consistent with Moore’s law. Developers absolutely prefer more processing density in a given space, rather than having to distribute their system in a larger space. The problem for developers is that in order to achieve the benefit of silicon integration, lately, it’s been necessary to parallelize your code. As time goes on, more parallelism is needed to take advantage of the additional benefits that silicon integration affords.
As an aside, the GPU/CPU wrinkle on this, is that as a parallel processor, the GPU is in its best light. This comes about not through any silicon benefit (NVIDIA, AMD, and others, do not have silicon integration capabilities that are greater than intel’s), but through a significantly different design approach of the GPU vs. the CPU, i.e. in terms of allocation of silicon transistor budget or silicon real estate budget. For massively parallel codes, the GPU becomes a more effective choice than current CPU architectures. Since we are being pushed towards parallelism anyway, for those who are more easily able to adapt to massive parallelism, the GPU becomes on interesting choice.