DeepSeek Open-Sources DeepGEMM — Its H100 GEMM Library Behind the Cheapest Frontier Models

DeepSeek has published DeepGEMM, the FP8 GEMM kernel library that powers its training infrastructure. Hitting 1,550 TFLOPS on H800 with a JIT compiler that eliminates offline CUDA compilation, it's an unusually practical open-source ML infrastructure release.

Original source

DeepSeek has released DeepGEMM, the low-level GPU kernel library that underpins its training and inference infrastructure — the same stack that produced some of 2026's most cost-efficient frontier models. The library provides highly optimized FP8 General Matrix Multiplication (GEMM) kernels for NVIDIA SM90 and SM100 GPUs (H100/H800 and Blackwell class), posting up to 1,550 TFLOPS on H800 hardware.

The engineering standout is a lightweight just-in-time compiler that eliminates the need to compile CUDA kernels at install time. For teams who've battled CUDA version mismatches, toolkit dependency hell, and slow build pipelines, this is a meaningful operational improvement. The library also covers FP8 and FP4 dense GEMMs, grouped Mixture-of-Experts GEMMs with overlapped NVLink communications, and multi-query attention scoring kernels — covering the critical paths for modern LLM training and inference.

DeepSeek's pattern of open-sourcing its internal infrastructure has become one of the most consequential forcing functions in the AI industry. Each release raises the efficiency floor for the entire field. DeepGEMM is particularly noteworthy because it directly addresses GPU compute — the most expensive line item for any organization running LLM workloads. Teams using vLLM or SGLang can integrate DeepGEMM today for potential throughput gains on H100/H800 clusters.

The MIT license means commercial integration has no royalty concerns. With GPU compute costs still elevated after the Blackwell supply crunch drove prices up 48% in Q1 2026, any tool that extracts more throughput from existing hardware has immediate financial relevance. DeepGEMM is released at exactly the right moment.

Panel Takes

The Builder

Developer Perspective

“The JIT compiler is the killer feature — no more CUDA compilation at install time, no more version hell. If you're running inference on H100s, benchmark this against your current kernel library before the week is out. The MoE GEMM with NVLink overlap is particularly interesting for anyone running sparse models.”

The Skeptic

Reality Check

“This is excellent infrastructure for a narrow slice of the industry — teams with H100/H800 clusters and the engineering bandwidth to integrate low-level GPU libraries. Most ML teams run on managed infrastructure where they never touch kernel libraries. Real impact is confined to a few dozen hyperscale operators.”

The Futurist

Big Picture

“DeepSeek's open-sourcing philosophy is quietly one of the most powerful forces in AI democratization. Every kernel library, every architectural insight, every training recipe they publish accelerates the entire field. DeepGEMM means every organization's H100 fleet just got a bit more powerful without spending another dollar.”

Panel Takes

Bookmarks