Nvidia Debuts Rubin CPX: Specialized GPU for Long-Context AI Processing

Nvidia has unveiled its latest hardware innovation: Rubin CPX, a processor designed for heavy AI inferencing, massive-context workloads, and emerging use cases such as generative video and code tokenization.

Why Rubin CPX Matters

Generative AI services—like ChatGPT, Google Gemini, and Perplexity—operate on a token-based system. Each user prompt consumes tokens, with simple questions using a handful while complex reasoning tasks may require hundreds more. The faster a platform can process tokens, the more queries it can handle—and the greater the revenue potential. Nvidia positions Rubin CPX as a solution to increase both speed and efficiency in this token-driven economy.

Two Stages of AI Inference

Shar Narasimhan, director of product for Nvidia’s Data Center division, explained that inference is often oversimplified as a single stage, when in fact it consists of two distinct phases:

Prefill (context) phase – highly compute-intensive, preparing the model with vast amounts of contextual data.
Decode (generation) phase – more memory-driven, responsible for producing output.

Traditional GPUs have handled both tasks, despite being optimized for only one of them. Rubin CPX has been engineered specifically to supercharge the context phase, where compute power is critical.

“This chip will significantly increase the throughput of AI factories,” Narasimhan noted, pointing to its ability to boost token generation rates—the fundamental work units of generative AI.

Inside Rubin CPX

The Rubin architecture comes in several variants. Standard Rubin chips carry two dies with 25 petaflops each, interconnected via NVLink, and feature 288GB of HBM4 memory.

By contrast, Rubin CPX is a more specialized option:

Single die delivering 30 petaflops of compute power.
Equipped with 128GB of GDDR7 memory instead of high-bandwidth HBM.
Built without NVLink, keeping costs lower while still excelling in targeted use cases.

Nvidia claims that Rubin CPX offers three times faster attention performance than the GB300 NVL72 systems and delivers up to 30 petaflops with NVFP4 precision.

Scaling and Deployment Options

Rubin CPX can be deployed in a variety of ways, including as part of the Vera Rubin NVL144 CPX configuration. It integrates seamlessly with Nvidia’s networking stack, such as:

Quantum-X800 InfiniBand fabric for scale-out clusters.
Spectrum-X Ethernet platform powered by Spectrum-XGS and ConnectX-9 SuperNICs.

Target Workloads

This configuration makes Rubin CPX particularly well-suited for long-context AI tasks—including workloads that may involve processing millions of tokens, such as hour-long generative video projects. While standard GPUs might take days to complete these jobs, Rubin CPX is optimized to compress those timelines significantly.

IPNET