GOOGLE PUB_DATE: 2026.03.25

GOOGLE DONATES LLM-D LLM INFERENCE GATEWAY TO CNCF SANDBOX

Google open-sourced llm-d, a Kubernetes-native LLM inference gateway, into the CNCF Sandbox with backing from IBM, Red Hat, NVIDIA, and Anyscale. llm-d isn’t a...

Google donates llm-d LLM inference gateway to CNCF Sandbox

Google open-sourced llm-d, a Kubernetes-native LLM inference gateway, into the CNCF Sandbox with backing from IBM, Red Hat, NVIDIA, and Anyscale.

llm-d isn’t a model or training stack; it’s the routing and scheduling layer for running LLMs on Kubernetes, with features like intelligent request routing and KV cache reuse to tame bursty, GPU-heavy traffic WebProNews. The aim is to standardize the “plumbing” of LLM inference so teams don’t keep rebuilding the same gateway layer.

The move lands in the CNCF as a vendor-neutral project and arrives alongside IBM and Red Hat’s push for a Kubernetes “blueprint” for LLM inference deployments The New Stack. For platform teams, this points to a common control plane for multi-model, multi-cluster LLM serving on K8s.

[ WHY_IT_MATTERS ]
01.

Inference is now the dominant AI cost center; a common, open gateway could lower latency and GPU burn via smarter routing and cache reuse.

02.

CNCF stewardship increases the odds of broad ecosystem support and less vendor lock-in across model backends and accelerator fleets.

[ WHAT_TO_TEST ]
  • terminal

    Deploy llm-d in a staging Kubernetes cluster and benchmark P50/P95/P99 latency and GPU utilization with and without KV cache reuse under bursty load.

  • terminal

    Route traffic across multiple model backends and compare tail latency, throughput, and failover behavior versus your current gateway or direct-to-runtime approach.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Front existing K8s-hosted inference services with llm-d; map auth, quotas, tracing, and metrics to your current stack.

  • 02.

    Plan a phased cutover by shadowing production traffic through llm-d and validating cache hit rates, autoscaling, and SLOs.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Adopt llm-d as the default inference entrypoint to standardize routing, caching, and observability from day one.

  • 02.

    Design for multi-tenant clusters and heterogeneous GPUs, letting llm-d handle placement while you keep business logic thin.

SUBSCRIBE_FEED
Get the digest delivered. No spam.