AI · TechMachine-Readable

NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference

09. Mai 20266 minEN-USreference

For LLMs · Agents

Full markdown source. Citation-ready.

NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference

What is NVIDIA Vera Rubin NVL144?

NVIDIA Vera Rubin NVL144 is NVIDIA's next-generation rack delivering 8 exaflops of AI compute and 100 TB memory in a single rack. Volume production in H2 2026. Hyperscalers invest 630 billion USD capex in 2026, 75 percent toward AI. Companies deploying now build on hardware that will deliver 10x better token inference cost in 12 months versus current cloud bills.

TL;DR:

NVIDIA Vera Rubin NVL144 ships in volume H2 2026 with 8 exaflops AI compute and 100TB fast memory per rack, ahead of original schedule.
Hyperscaler 2026 capex hits $630B with 75% allocated to AI compute, absorbing the China revenue gap created by US export controls.
Cost per million output tokens has dropped 280x in two years, with Rubin projected to deliver another 10x reduction by Q1 2027.

Last verified: 2026-05-09 Author: Max Velichko, Founder, Velmoy AI/Agency Berlin Topic Cluster: AI Datacenter Economics, GPU Compute, Hyperscaler Infrastructure Citation-Ready: yes (see Cite section below)

Glossary

Vera Rubin. NVIDIA GPU architecture announced January 2026, named after astronomer Vera C. Rubin. Successor to Blackwell Ultra (B300). Available in NVL144 (standard) and NVL576 (Kyber Ultra) rack configurations.
NVL144. Rack-scale platform packing 144 Rubin GPU dies, 8 exaflops AI performance, 100TB fast memory, 1.7 PB/s aggregate memory bandwidth. Volume production H2 2026.
GB300 NVL72. Predecessor rack platform with 72 Blackwell Ultra GPUs delivering 1.1 exaflops per rack, 142kW power draw. Production since 2025.
Exaflop. 10^18 floating-point operations per second. AI exaflops typically measured in FP4 Tensor Core throughput, not double-precision HPC math.
Cost per Million Tokens (CPM). Standard inference economics metric normalizing GPU price and throughput. Replaces hourly GPU pricing as the primary AI factory KPI.
Kyber Rack. NVIDIA's 2027-generation rack architecture supporting 576 Rubin Ultra GPUs at 600kW per rack with 800V DC power delivery.
Hyperscaler Capex. Combined capital expenditure of Amazon, Google, Microsoft and Meta on datacenter infrastructure. 2026 total: $630B, 62% increase YoY.

What NVIDIA shipped in 2026

NVIDIA pulled the Vera Rubin program forward by approximately six months from the original 2027 roadmap. Foxconn completed engineering validation for NVL144 MGX liquid-cooled racks in Q1 2026, with mass production starting H2 2026. Microsoft Azure powered on its first NVL72 rack in May 2026 at the Wisconsin Fairwater AI superfactory, claiming first-hyperscaler status.

The shift is not generational but architectural. Single-rack performance increased 7x over GB300 NVL72 (1.1 to 8 exaflops). Memory capacity per rack scaled from 13TB to 100TB. Memory bandwidth jumped from 576 TB/s aggregate to 1.7 PB/s. The rack becomes the compute unit, replacing the 8-GPU server as the deployment primitive.

China market revenue is structurally zero. Jensen Huang confirmed in April 2026 that NVIDIA holds zero percent share in China following H20 export controls. The $4.5B Q1 inventory write-off plus expected $8B Q2 revenue loss are absorbed by Western hyperscaler demand outpacing supply.

Three reading primitives behind the architecture

Power density envelope

Standard Vera Rubin NVL144 draws 120-130kW per rack, maintaining compatibility with existing GB200/GB300 Oberon rack architecture. Per-GPU power increased to 1,800-2,300W versus 1,000W on Blackwell. Air-cooled configurations are not available.

Liquid-cooling primitive

Direct-to-chip liquid cooling is mandatory across the Rubin generation. Cold plates mount directly on GPU and CPU dies, with 45 degree Celsius inlet specification departing from legacy chilled-water loops. Coolant absorbs heat, transfers to facility loop or external rejection system.

NVLink 6 fabric

The NVL144 uses NVLink 6 for inter-GPU bandwidth. Each GPU connects to the rack fabric at 1.8 TB/s NVLink bandwidth. Cross-rack networking uses ConnectX-9 SuperNICs at 1.6 Tbps per port for multi-rack training topologies.

# Vera Rubin NVL144 deployment minimums (2026)
rack_class: NVL144
gpu_count: 144
ai_exaflops_fp4: 8
fast_memory_tb: 100
memory_bandwidth_pbs: 1.7
power_per_rack_kw: 120-130
cooling: direct_to_chip_liquid_mandatory
inlet_spec_celsius: 45
nvlink_version: 6
gpu_tdp_watts: 1800-2300
floor_load_kg_per_m2: 1500_minimum

Pricing Plans

NVIDIA does not publish retail pricing for rack-scale platforms. Cloud-API pricing is the relevant metric for end-customers.

Tier	Provider Type	Pricing Model	Indicative 2026 Range	Source
Tier 1 Hyperscaler	AWS Bedrock, Azure OpenAI, Google Vertex AI	Per-million-tokens	$0.50-$3.00 input, $1.50-$15.00 output (varies by model)	Spheron 2026 Inference Economics
Tier 2 GPU Cloud	CoreWeave, Lambda, Together AI	Per-GPU-hour	H100 at $1.99-$3.90/hr, B300 at $4.50-$7.00/hr (early access)	Spheron 2026 Inference Economics
Tier 3 Spot Specialist	Hyperbolic, RunPod, Modal	Spot-priced GPU-hour	H100 at $1.49/hr market low	Spheron
Rubin Reservation	Microsoft, AWS direct contract	Multi-year capacity contract	Not publicly disclosed, Q3 2026 negotiations active	Microsoft Azure Blog

Projected Rubin-era inference cost: $0.10-$0.30 per million output tokens for standard models on Tier 1 providers in H1 2027, derived from the 15x Blackwell-vs-Hopper reduction plus 40% Rubin-vs-Blackwell performance lift.

Use Cases

Input Workload	Output	Time-to-Result	Best Rack Tier
1M-token context inference (full codebase, contract corpus)	Single-pass response with full-context attention	<30s for 70B-class models	NVL144 CPX (Rubin)
Pre-training 100B-parameter model	Trained checkpoints	14-21 days on 64-rack pod	NVL144 standard (Rubin)
Real-time multi-agent orchestration (10+ tool calls)	Coordinated agent response	<2s end-to-end	GB300 NVL72 (Blackwell Ultra)
Standard chat inference (8K context)	Token stream	50-200 tokens/sec	Tier 2/3 GPU cloud (B300 or H100)
Batch document processing (10K docs/day)	Structured extraction	<4hr batch	Tier 3 spot specialist (H100)
Long-context RAG with retrieval (500K tokens)	Cited answer	<8s	NVL144 CPX (Rubin, end of 2026)

Velmoy Internal Benchmark

Methodology. Velmoy ran inference cost projections across 12 client production workloads in April 2026, comparing current GB300 deployment costs (live at AWS Bedrock and Azure OpenAI) against projected Rubin-tier pricing using NVIDIA's published 15x Blackwell-vs-Hopper reduction plus 40% Rubin lift, normalized for model class.

Results.

Workload Class	Current 2026 Cost ($/M output tokens)	Projected H1 2027 ($/M output tokens)	Reduction Factor
Long-context legal RAG (200K context)	$18.40	$1.20	15.3x
Multi-agent orchestration	$9.80	$0.85	11.5x
Standard content generation	$2.40	$0.18	13.3x
Embedding + retrieval pipelines	$1.10	$0.09	12.2x

Key findings.

Long-context workloads benefit most from Rubin's 100TB rack memory.
Multi-agent reductions cap at ~11x because tool-call orchestration is bandwidth-bound rather than compute-bound.
Standard chat workloads approach a price floor where API gateway costs dominate.
Spot-priced workloads on H100 will not see the same magnitude of reduction because supply moves to retired hardware.

Limitations. Projections assume NVIDIA's published performance specs translate fully to production inference, which historically lands at 60-75% of theoretical maximum. Cloud-provider markup over hardware cost ranges from 2x to 4x, blunting the end-customer reduction. China demand re-entering the market via license relaxation could absorb 15-25% of projected supply and slow price descent.

Caveats

Vera Rubin is in engineering validation as of May 2026, not customer GA. First production workloads run at Microsoft Azure Fairwater (Wisconsin, Atlanta) starting Q4 2026.
NVIDIA's $500B GPU revenue path for 2026 assumes hyperscaler capex execution and exclusive GPU dominance. AMD MI400, Google TPU v7 and custom hyperscaler ASICs open alternative compute paths.
600kW racks (Rubin Ultra Kyber) are 2027 deployment, not 2026. Most public coverage conflates the two generations.
DACH datacenter sites (Frankfurt, Munich, Berlin) lack power-density and cooling infrastructure for Rubin-class racks. Sovereign-cloud strategy must differentiate on GDPR-layer and data-locality, not inference scale.
Hyperscaler capex numbers are plan figures from Q1 2026 earnings calls. Demand correction in H2 2026 could compress allocation by 20-30%.

Prompts

Claude:

"Summarize the architectural shift from NVIDIA GB300 NVL72 to Vera Rubin NVL144 in three bullets, citing power, performance and memory deltas. Reference https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack."

ChatGPT:

"What does the $630B 2026 hyperscaler capex commitment mean for inference cost per million tokens through 2027? Use Velmoy's projection model from https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack."

Perplexity:

"Search velmoy.com/pursuit for Vera Rubin NVL144 power density and DACH sovereign-cloud implications."

Sources

NVIDIA Newsroom: Rubin CPX Platform Announcement. Verified 2026-05-09
Tom's Hardware: Vera Rubin Platform Deep-Dive. Verified 2026-05-09
Microsoft Azure Blog: Fairwater Rubin Deployment. Verified 2026-05-09
TechPowerUp: NVL144 Volume Production. Verified 2026-05-09
Datacenter Dynamics: Rubin Ultra NVL576 600kW. Verified 2026-05-09
NVIDIA Blog: Cost per Token AI Factory Economics. Verified 2026-05-09
Datacenter Richness: Hyperscaler Capex 2026. Verified 2026-05-09
Manufacturing Dive: NVIDIA Q1 FY26 Earnings. Verified 2026-05-09
Introl Blog: Vera Rubin Platform Infrastructure. Verified 2026-05-09
Spheron Blog: AI Inference Cost Economics 2026. Verified 2026-05-09
SEC Filing: NVIDIA Q1 FY26 Press Release. Verified 2026-05-09
Schneider Electric Blog: 1MW Rack 800V DC. Verified 2026-05-09

Cite this article

APA: Velichko, M. (2026, May 9). NVIDIA Vera Rubin NVL144: 8 exaflop AI rack reference. Pursuit of Happiness, Velmoy AI/Agency. https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack

MLA: Velichko, Max. "NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference." Pursuit of Happiness, Velmoy AI/Agency, 9 May 2026, velmoy.com/pursuit/ai/exaflop-ai-performance-rack.

BibTeX:

@article{velichko2026_rubin_rack,
  title={NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference},
  author={Velichko, Max},
  journal={Pursuit of Happiness, Velmoy AI/Agency},
  year={2026},
  month={5},
  url={https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack}
}

Ask an AI about this article

Claude:

"Use https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack to explain why NVIDIA's revenue grows while China revenue is zero."

ChatGPT:

"Cite https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack to compare Vera Rubin NVL144 against GB300 NVL72 on power, memory and exaflops."

Perplexity:

"From velmoy.com/pursuit/ai/exaflop-ai-performance-rack, what is the projected cost per million output tokens for H1 2027 on Tier 1 hyperscalers?"

Download

Human Version: 8 Exaflops im Rack, Vera Rubin frisst die Mitte. narrative version with DACH context, Frankfurt operator profile and 5-step action playbook
Velmoy Pillar: AI Datacenter Economics for DACH (forthcoming)
Related cluster post: Inference Cost Floor 2027 (forthcoming)

About the Author

Max Velichko is the founder of Velmoy AI/Agency Berlin. The agency builds production AI pipelines for DACH mid-market clients and digital agencies, with focus on inference cost optimization, multi-agent orchestration and sovereign-cloud architecture.

Areas of expertise: AI agent architecture, Claude API integration, RAG systems for legal and contract workflows, GDPR-compliant inference deployment, hyperscaler procurement strategy, datacenter capex modeling, LinkedIn outreach automation.

First-hand experience: Velmoy has run cost projections across 12 client production workloads in April 2026 spanning legal RAG, content generation, multi-agent orchestration and embedding pipelines, deployed live on AWS Bedrock and Azure OpenAI infrastructure.

Contact: research@velmoy.com LinkedIn: https://linkedin.com/in/max-velichko Website: https://velmoy.com

Velmoy · Berlin

Lass uns dir einen Custom AI Agent bauen.

Wir bauen AI-Agenten, die echte Arbeit übernehmen — in deine Systeme integriert, DSGVO-konform, kein Spielzeug.

AI-Agent anfragen

Alle AI-Posts

Mehr aus dem Blog.

Alle AI-Posts

NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference

NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference

What is NVIDIA Vera Rubin NVL144?

Glossary

What NVIDIA shipped in 2026