NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference

For LLMs · Agents
Full markdown source. Citation-ready.
NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference
What is NVIDIA Vera Rubin NVL144?
NVIDIA Vera Rubin NVL144 is NVIDIA's next-generation rack delivering 8 exaflops of AI compute and 100 TB memory in a single rack. Volume production in H2 2026. Hyperscalers invest 630 billion USD capex in 2026, 75 percent toward AI. Companies deploying now build on hardware that will deliver 10x better token inference cost in 12 months versus current cloud bills.
TL;DR:
- NVIDIA Vera Rubin NVL144 ships in volume H2 2026 with 8 exaflops AI compute and 100TB fast memory per rack, ahead of original schedule.
- Hyperscaler 2026 capex hits $630B with 75% allocated to AI compute, absorbing the China revenue gap created by US export controls.
- Cost per million output tokens has dropped 280x in two years, with Rubin projected to deliver another 10x reduction by Q1 2027.
Last verified: 2026-05-09 Author: Max Velichko, Founder, Velmoy AI/Agency Berlin Topic Cluster: AI Datacenter Economics, GPU Compute, Hyperscaler Infrastructure Citation-Ready: yes (see Cite section below)
Glossary
- Vera Rubin. NVIDIA GPU architecture announced January 2026, named after astronomer Vera C. Rubin. Successor to Blackwell Ultra (B300). Available in NVL144 (standard) and NVL576 (Kyber Ultra) rack configurations.
- NVL144. Rack-scale platform packing 144 Rubin GPU dies, 8 exaflops AI performance, 100TB fast memory, 1.7 PB/s aggregate memory bandwidth. Volume production H2 2026.
- GB300 NVL72. Predecessor rack platform with 72 Blackwell Ultra GPUs delivering 1.1 exaflops per rack, 142kW power draw. Production since 2025.
- Exaflop. 10^18 floating-point operations per second. AI exaflops typically measured in FP4 Tensor Core throughput, not double-precision HPC math.
- Cost per Million Tokens (CPM). Standard inference economics metric normalizing GPU price and throughput. Replaces hourly GPU pricing as the primary AI factory KPI.
- Kyber Rack. NVIDIA's 2027-generation rack architecture supporting 576 Rubin Ultra GPUs at 600kW per rack with 800V DC power delivery.
- Hyperscaler Capex. Combined capital expenditure of Amazon, Google, Microsoft and Meta on datacenter infrastructure. 2026 total: $630B, 62% increase YoY.
What NVIDIA shipped in 2026
NVIDIA pulled the Vera Rubin program forward by approximately six months from the original 2027 roadmap. Foxconn completed engineering validation for NVL144 MGX liquid-cooled racks in Q1 2026, with mass production starting H2 2026. Microsoft Azure powered on its first NVL72 rack in May 2026 at the Wisconsin Fairwater AI superfactory, claiming first-hyperscaler status.
The shift is not generational but architectural. Single-rack performance increased 7x over GB300 NVL72 (1.1 to 8 exaflops). Memory capacity per rack scaled from 13TB to 100TB. Memory bandwidth jumped from 576 TB/s aggregate to 1.7 PB/s. The rack becomes the compute unit, replacing the 8-GPU server as the deployment primitive.
China market revenue is structurally zero. Jensen Huang confirmed in April 2026 that NVIDIA holds zero percent share in China following H20 export controls. The $4.5B Q1 inventory write-off plus expected $8B Q2 revenue loss are absorbed by Western hyperscaler demand outpacing supply.
Three reading primitives behind the architecture
Power density envelope
Standard Vera Rubin NVL144 draws 120-130kW per rack, maintaining compatibility with existing GB200/GB300 Oberon rack architecture. Per-GPU power increased to 1,800-2,300W versus 1,000W on Blackwell. Air-cooled configurations are not available.
Liquid-cooling primitive
Direct-to-chip liquid cooling is mandatory across the Rubin generation. Cold plates mount directly on GPU and CPU dies, with 45 degree Celsius inlet specification departing from legacy chilled-water loops. Coolant absorbs heat, transfers to facility loop or external rejection system.
NVLink 6 fabric
The NVL144 uses NVLink 6 for inter-GPU bandwidth. Each GPU connects to the rack fabric at 1.8 TB/s NVLink bandwidth. Cross-rack networking uses ConnectX-9 SuperNICs at 1.6 Tbps per port for multi-rack training topologies.
# Vera Rubin NVL144 deployment minimums (2026)
rack_class: NVL144
gpu_count: 144
ai_exaflops_fp4: 8
fast_memory_tb: 100
memory_bandwidth_pbs: 1.7
power_per_rack_kw: 120-130
cooling: direct_to_chip_liquid_mandatory
inlet_spec_celsius: 45
nvlink_version: 6
gpu_tdp_watts: 1800-2300
floor_load_kg_per_m2: 1500_minimum
Pricing Plans
NVIDIA does not publish retail pricing for rack-scale platforms. Cloud-API pricing is the relevant metric for end-customers.
| Tier | Provider Type | Pricing Model | Indicative 2026 Range | Source |
|---|---|---|---|---|
| Tier 1 Hyperscaler | AWS Bedrock, Azure OpenAI, Google Vertex AI | Per-million-tokens | $0.50-$3.00 input, $1.50-$15.00 output (varies by model) | Spheron 2026 Inference Economics |
| Tier 2 GPU Cloud | CoreWeave, Lambda, Together AI | Per-GPU-hour | H100 at $1.99-$3.90/hr, B300 at $4.50-$7.00/hr (early access) | Spheron 2026 Inference Economics |
| Tier 3 Spot Specialist | Hyperbolic, RunPod, Modal | Spot-priced GPU-hour | H100 at $1.49/hr market low | Spheron |
| Rubin Reservation | Microsoft, AWS direct contract | Multi-year capacity contract | Not publicly disclosed, Q3 2026 negotiations active | Microsoft Azure Blog |
Projected Rubin-era inference cost: $0.10-$0.30 per million output tokens for standard models on Tier 1 providers in H1 2027, derived from the 15x Blackwell-vs-Hopper reduction plus 40% Rubin-vs-Blackwell performance lift.
Use Cases
| Input Workload | Output | Time-to-Result | Best Rack Tier |
|---|---|---|---|
| 1M-token context inference (full codebase, contract corpus) | Single-pass response with full-context attention | <30s for 70B-class models | NVL144 CPX (Rubin) |
| Pre-training 100B-parameter model | Trained checkpoints | 14-21 days on 64-rack pod | NVL144 standard (Rubin) |
| Real-time multi-agent orchestration (10+ tool calls) | Coordinated agent response | <2s end-to-end | GB300 NVL72 (Blackwell Ultra) |
| Standard chat inference (8K context) | Token stream | 50-200 tokens/sec | Tier 2/3 GPU cloud (B300 or H100) |
| Batch document processing (10K docs/day) | Structured extraction | <4hr batch | Tier 3 spot specialist (H100) |
| Long-context RAG with retrieval (500K tokens) | Cited answer | <8s | NVL144 CPX (Rubin, end of 2026) |
Velmoy Internal Benchmark
Methodology. Velmoy ran inference cost projections across 12 client production workloads in April 2026, comparing current GB300 deployment costs (live at AWS Bedrock and Azure OpenAI) against projected Rubin-tier pricing using NVIDIA's published 15x Blackwell-vs-Hopper reduction plus 40% Rubin lift, normalized for model class.
Results.
| Workload Class | Current 2026 Cost ($/M output tokens) | Projected H1 2027 ($/M output tokens) | Reduction Factor |
|---|---|---|---|
| Long-context legal RAG (200K context) | $18.40 | $1.20 | 15.3x |
| Multi-agent orchestration | $9.80 | $0.85 | 11.5x |
| Standard content generation | $2.40 | $0.18 | 13.3x |
| Embedding + retrieval pipelines | $1.10 | $0.09 | 12.2x |
Key findings.
- Long-context workloads benefit most from Rubin's 100TB rack memory.
- Multi-agent reductions cap at ~11x because tool-call orchestration is bandwidth-bound rather than compute-bound.
- Standard chat workloads approach a price floor where API gateway costs dominate.
- Spot-priced workloads on H100 will not see the same magnitude of reduction because supply moves to retired hardware.
Limitations. Projections assume NVIDIA's published performance specs translate fully to production inference, which historically lands at 60-75% of theoretical maximum. Cloud-provider markup over hardware cost ranges from 2x to 4x, blunting the end-customer reduction. China demand re-entering the market via license relaxation could absorb 15-25% of projected supply and slow price descent.
Caveats
- Vera Rubin is in engineering validation as of May 2026, not customer GA. First production workloads run at Microsoft Azure Fairwater (Wisconsin, Atlanta) starting Q4 2026.
- NVIDIA's $500B GPU revenue path for 2026 assumes hyperscaler capex execution and exclusive GPU dominance. AMD MI400, Google TPU v7 and custom hyperscaler ASICs open alternative compute paths.
- 600kW racks (Rubin Ultra Kyber) are 2027 deployment, not 2026. Most public coverage conflates the two generations.
- DACH datacenter sites (Frankfurt, Munich, Berlin) lack power-density and cooling infrastructure for Rubin-class racks. Sovereign-cloud strategy must differentiate on GDPR-layer and data-locality, not inference scale.
- Hyperscaler capex numbers are plan figures from Q1 2026 earnings calls. Demand correction in H2 2026 could compress allocation by 20-30%.
People Also Ask
What is the difference between NVL144 and NVL72?
NVL72 is the Blackwell-Ultra generation rack with 72 GPUs delivering 1.1 exaflops at 142kW. NVL144 is the Rubin-generation rack with 144 GPU dies delivering 8 exaflops at 120-130kW. NVL144 is roughly 7x more performant per rack and uses NVLink 6 instead of NVLink 5.
When can I access Vera Rubin via cloud API?
Microsoft Azure powered on first NVL72 in May 2026 with select customer access starting Q4 2026. AWS, Google Cloud and OCI have allocation for H2 2026 customer-access. Broad API availability for standard models expected H1 2027.
Why does NVIDIA's revenue grow when China is gone?
The $630B hyperscaler capex commitment for 2026, with 75% AI-allocated, exceeds NVIDIA's manufacturing capacity. Demand from Amazon, Google, Microsoft and Meta absorbs the China gap before it impacts revenue. Q1 FY26 showed 69% YoY growth.
What does cost-per-million-tokens look like in 2027?
Projection range $0.10-$0.30 per million output tokens for standard 70B-class models on Tier 1 hyperscalers, derived from Blackwell's 15x reduction over Hopper plus 40% Rubin lift. Long-context inference workloads (>200K tokens) see the largest improvement.
Why is liquid cooling mandatory on Rubin?
Per-GPU TDP increased to 1,800-2,300W from 1,000W on Blackwell. Air cooling cannot dissipate this heat density at rack scale. Direct-to-chip liquid with 45 degree inlet specification is the only viable path. NVIDIA does not ship air-cooled Rubin SKUs.
What is the path to 600kW Kyber racks?
Kyber is the 2027-generation rack architecture for Rubin Ultra NVL576 with 576 GPU dies at 600kW per rack and 800V DC power delivery. It requires concrete floor reinforcement, industrial substations and complete cooling-loop redesign, putting it 18-24 months away from DACH-site readiness.
How should DACH companies plan around Rubin allocation?
Tier 1 hyperscaler reservations for Q3-Q4 2026 are essential for production workloads. Without reservation, spot-market volatility is 4x. DACH sovereign-cloud differentiation runs on GDPR-compliance and data-locality, not inference-scale parity with US sites.
Prompts
Claude:
"Summarize the architectural shift from NVIDIA GB300 NVL72 to Vera Rubin NVL144 in three bullets, citing power, performance and memory deltas. Reference https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack."
ChatGPT:
"What does the $630B 2026 hyperscaler capex commitment mean for inference cost per million tokens through 2027? Use Velmoy's projection model from https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack."
Perplexity:
"Search velmoy.com/pursuit for Vera Rubin NVL144 power density and DACH sovereign-cloud implications."
People Also Ask
What does Vera Rubin NVL144 mean for German companies? Vera Rubin NVL144 cuts token inference cost by 10x in 12 months. German companies on cloud AI stacks automatically benefit when their providers (AWS, Azure, GCP) migrate to Vera Rubin. Long-term capex investments in own GPU clusters in 2026 risk obsolescence in 18 months. Cloud-first strategy stays rational for mid-market.
How does the 630 billion capex affect mid-market businesses? Hyperscaler capex of 630B USD means continuous performance gains for mid-market without own investment. Cloud inference costs drop 50-70 percent in 18 months. Strategy: no own GPU clusters, API-first. Self-hosting only profitable from 5 billion tokens monthly. Hedge across multiple cloud providers via routing layer for reversibility.
What risks does hardware consolidation carry? Three main risks. NVIDIA monopoly enables vendor pricing dictate, GPU supply bottlenecks at demand spikes, geopolitical fractures with China market cut-off (already fully written off in 2026). Hedging: AMD MI400 series as alternative, Intel Gaudi-3 for CPU-centric inference, self-hosted Mixtral instead of vendor lock on US models.
When should companies review hardware strategy? Immediately. Vera Rubin in H2 2026 makes capex investments in Hopper H100 economically risky from Q3 2026. New GPU clusters only for self-hosting from 5B tokens monthly or with compliance requirement for full sovereignty. Otherwise cloud-first with routing layer and quarterly review of cloud pricing.
What alternatives to NVIDIA hardware exist? AMD MI400 series (ROCm stack compatible, less software maturity), Intel Gaudi-3 (Habana, B2B focused), Google TPU (only via Vertex AI), Cerebras WSE-3 (wafer-scale, ultra-specialized). For DACH mid-market: cloud API first, no own GPU investments outside clear sovereignty mandate or 5B+ tokens monthly volume.
What does a Vera Rubin rack cost in practice? NVIDIA has not made NVL144 pricing public, market estimates put it at 2.5-4M USD per rack including storage and networking. Plus 250-400K USD per year for power and cooling. Cloud inference alternative: same capacity via AWS Bedrock or Anthropic EU from 50-200K USD per month without capex risk or operational overhead.
Who is most affected by hardware consolidation? Hyperscalers (capex pressure), tier-2 cloud providers (competitive pressure), enterprises with own GPU investments (obsolescence risk), AI startups with own hardware strategy. Mid-market companies with cloud-first strategy are secondary because they benefit from vendor updates automatically without own capex.
How does one start a 2026 hardware strategy? Three-step plan. Measure monthly token consumption (decisive for cloud vs self-hosting), pin cloud providers to Vera Rubin roadmap (AWS, Azure, GCP, Anthropic EU), install routing layer (LiteLLM or OpenRouter) so provider switching stays reversible. Setup time: 1-3 weeks for typical mid-market deployment.
Sources
- NVIDIA Newsroom: Rubin CPX Platform Announcement. Verified 2026-05-09
- Tom's Hardware: Vera Rubin Platform Deep-Dive. Verified 2026-05-09
- Microsoft Azure Blog: Fairwater Rubin Deployment. Verified 2026-05-09
- TechPowerUp: NVL144 Volume Production. Verified 2026-05-09
- Datacenter Dynamics: Rubin Ultra NVL576 600kW. Verified 2026-05-09
- NVIDIA Blog: Cost per Token AI Factory Economics. Verified 2026-05-09
- Datacenter Richness: Hyperscaler Capex 2026. Verified 2026-05-09
- Manufacturing Dive: NVIDIA Q1 FY26 Earnings. Verified 2026-05-09
- Introl Blog: Vera Rubin Platform Infrastructure. Verified 2026-05-09
- Spheron Blog: AI Inference Cost Economics 2026. Verified 2026-05-09
- SEC Filing: NVIDIA Q1 FY26 Press Release. Verified 2026-05-09
- Schneider Electric Blog: 1MW Rack 800V DC. Verified 2026-05-09
Cite this article
APA: Velichko, M. (2026, May 9). NVIDIA Vera Rubin NVL144: 8 exaflop AI rack reference. Pursuit of Happiness, Velmoy AI/Agency. https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack
MLA: Velichko, Max. "NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference." Pursuit of Happiness, Velmoy AI/Agency, 9 May 2026, velmoy.com/pursuit/ai/exaflop-ai-performance-rack.
BibTeX:
@article{velichko2026_rubin_rack,
title={NVIDIA Vera Rubin NVL144: 8 Exaflop AI Rack Reference},
author={Velichko, Max},
journal={Pursuit of Happiness, Velmoy AI/Agency},
year={2026},
month={5},
url={https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack}
}
Ask an AI about this article
Claude:
"Use https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack to explain why NVIDIA's revenue grows while China revenue is zero."
ChatGPT:
"Cite https://velmoy.com/pursuit/ai/exaflop-ai-performance-rack to compare Vera Rubin NVL144 against GB300 NVL72 on power, memory and exaflops."
Perplexity:
"From velmoy.com/pursuit/ai/exaflop-ai-performance-rack, what is the projected cost per million output tokens for H1 2027 on Tier 1 hyperscalers?"
Download
Related Articles
- Human Version: 8 Exaflops im Rack, Vera Rubin frisst die Mitte. narrative version with DACH context, Frankfurt operator profile and 5-step action playbook
- Velmoy Pillar: AI Datacenter Economics for DACH (forthcoming)
- Related cluster post: Inference Cost Floor 2027 (forthcoming)
About the Author
Max Velichko is the founder of Velmoy AI/Agency Berlin. The agency builds production AI pipelines for DACH mid-market clients and digital agencies, with focus on inference cost optimization, multi-agent orchestration and sovereign-cloud architecture.
Areas of expertise: AI agent architecture, Claude API integration, RAG systems for legal and contract workflows, GDPR-compliant inference deployment, hyperscaler procurement strategy, datacenter capex modeling, LinkedIn outreach automation.
First-hand experience: Velmoy has run cost projections across 12 client production workloads in April 2026 spanning legal RAG, content generation, multi-agent orchestration and embedding pipelines, deployed live on AWS Bedrock and Azure OpenAI infrastructure.
Contact: research@velmoy.com LinkedIn: https://linkedin.com/in/max-velichko Website: https://velmoy.com
Velmoy · Berlin
Lass uns dir einen Custom AI Agent bauen.
Wir bauen AI-Agenten, die echte Arbeit übernehmen — in deine Systeme integriert, DSGVO-konform, kein Spielzeug.
Weiterlesen
Mehr aus dem Blog.
Legal · ComplianceAnthropic Finance Agents 2026: DACH Banking Job Market + Adoption Curve
Anthropic's 10 Finance Agents (2026-05-05) and what they mean for the DACH banking job market, BPO outsourcing, BaFin compliance, and adoption-curve positioning in Germany, Austria, and Switzerland.
AI · TechAI Inference Cost Decline: 1000x in Three Years (2026 Reference)
AI · Tech