As the last several years have shown, scaling up AI systems to train larger models with more parameters across more data is a very expensive proposition, and one that has made Nvidia fabulously rich.
But putting AI into production in enterprises, whether they are hyperscalers or regular enterprises, is quite possibly going to be more expensive, particularly as we move away from batch systems and move up to human-machine interactions with GenAI systems and all the way up to machine-machine, or agentic, AI inference.
The biggest bottlenecks in AI systems – compute, memory, and interconnect – are holding back both performance and profitability. These challenges are becoming increasingly apparent as we push the boundaries of AI capabilities.
Estimates from a simulator built by Ayar Labs suggests that the next generation of the GPT foundation model from OpenAI will include 32 different models with a total of 14 trillion parameters. No expected configuration of future iron from Nvidia-based “Rubin” GPU accelerators and improved versions of its existing copper-based NVSwitch interconnects will be able to sufficiently lower the cost of AI inference for this platform while also moving the interactivity of the inference to speeds that are suitable for agentic AI.
This is obviously a problem. If GenAI is to take hold, then something has got to give. And that something is very likely going to be electrical interconnections between AI accelerators and quite possibly even between those accelerators and their HBM stacked memory.
But how should AI accelerator architectures evolve to increase the performance of AI clusters while at the same time boosting their performance to levels that make agentic AI economically – and therefore technically – feasible?