Nvidia Blackwell Ultra B300 GPUs Now Available on AWS, Azure, and GCP

Nvidia's Blackwell Ultra B300 GPUs are now generally available through major cloud providers, delivering up to 2.5x inference throughput over the B200 for large language model workloads. The rollout through AWS, Azure, and Google Cloud marks a significant step-change in cloud-accessible AI compute density.

Original source

Nvidia has announced general availability of the Blackwell Ultra B300 GPU across its major cloud partner ecosystem, including AWS, Azure, and Google Cloud. The B300 succeeds the B200, which itself was only broadly available for less than a year, and targets the rapidly growing inference market rather than just training workloads. Nvidia claims up to 2.5x inference throughput improvement over the B200 for large language model tasks, driven by increased HBM3e memory bandwidth and architectural refinements to the Transformer Engine.

The move to cloud-first GA — rather than on-premises first — reflects where inference demand is actually concentrated. Hyperscalers running frontier model APIs and fine-tuned deployments are the immediate buyers, with enterprise private cloud access expected to follow through OEM partners later this year. The B300 also ships with NVLink 5 interconnects, which matters for multi-GPU inference serving of models that don't fit on a single card, a constraint that has quietly become the bottleneck for next-generation model deployments.

The 2.5x throughput claim is specifically scoped to LLM inference and carries an implicit asterisk: gains depend heavily on batch size, model architecture, and serving framework. Workloads not optimized for TensorRT-LLM or similar Nvidia-native tooling will see smaller real-world gains. Pricing through cloud partners has not been uniformly published, with each hyperscaler expected to set their own instance pricing, making direct cost-per-token comparisons difficult at launch.

For the AI infrastructure market, the cadence matters as much as the specs. Nvidia is now delivering successive GPU generations faster than most enterprise procurement cycles can absorb them, which creates a durable upgrade market in cloud while simultaneously making on-prem hardware bets riskier. Competitors including AMD and custom silicon from the hyperscalers themselves remain the only credible alternatives, though neither has matched Nvidia's software ecosystem depth at scale.

Panel Takes

The Builder

Developer Perspective

“The 2.5x inference throughput number is scoped specifically to LLM workloads on TensorRT-LLM, which means if you're not already in the Nvidia serving stack, you're not getting 2.5x — you're getting whatever your existing framework negotiates. That's not a knock, but it's the DX bet: the performance is real if you buy into the toolchain, and neutral-to-marginal if you don't. The NVLink 5 interconnect for multi-GPU inference is actually the more interesting primitive here — serving 70B+ models without hitting single-card memory walls has been a real ops headache, and that's a problem worth solving.”

The Skeptic

Reality Check

“The 2.5x claim comes from Nvidia's own benchmarks on Nvidia's own stack against Nvidia's previous hardware — that's not a methodology, that's a press release. Real-world gains for teams not running TensorRT-LLM in a controlled batch-inference setup will be smaller, possibly much smaller, and we won't have independent numbers until the hyperscalers publish their own cost-per-token data. What actually kills the story in 12 months isn't AMD — it's that AWS, Azure, and Google are all shipping custom silicon that runs their own frontier models cheaper than any third-party GPU can, and Nvidia's moat shrinks every quarter those TPU and Trainium deployments scale.”

The Futurist

Big Picture

“The thesis Nvidia is betting on is falsifiable: inference compute demand will grow faster than efficiency gains from model compression and quantization, keeping GPU utilization high enough to justify successive hardware generations at cloud scale. The second-order effect that gets underreported is what this does to the economics of fine-tuned model serving — if inference gets 2.5x cheaper per token, the cost floor for running a domain-specific model drops enough that enterprises who punted on private deployments will revisit. The dependency that has to not happen for this bet to pay off is a step-change in speculative decoding or state-space model efficiency that makes transformer-class compute less central — that's not imminent, but it's the actual risk.”

The Founder

Business & Market

“The cloud-first GA is the right call commercially — hyperscalers move inventory and provide Nvidia's revenue certainty without the channel complexity of enterprise hardware sales cycles. But the real strategic question is what happens to Nvidia's pricing power when AWS Trainium, Google TPUs, and Microsoft's Maia reach cost parity on the specific workloads — primarily inference — that the B300 targets. Nvidia's moat has always been CUDA and the software stack, not the silicon, and that moat holds as long as the developer ecosystem stays captive; the moment a credible TensorRT alternative reaches production parity, the hardware margin conversation changes entirely.”

Panel Takes

Bookmarks