LLM-INFERENCE
30 days · UTC
Synchronizing with global intelligence nodes...
Google’s TurboQuant targets 6x smaller KV caches and faster LLM serving without quality loss
Google Research unveiled TurboQuant, a KV‑cache compression method claiming up to 6x lower memory and up to 8x speed gains without hurting output qual...
Google donates llm-d LLM inference gateway to CNCF Sandbox
Google open-sourced llm-d, a Kubernetes-native LLM inference gateway, into the CNCF Sandbox with backing from IBM, Red Hat, NVIDIA, and Anyscale. llm...
The practical playbook for faster, cheaper LLM inference: vLLM, KV caches, and decoding tricks
A hands-on deep dive shows how to speed up and scale LLM inference with vLLM, KV caching, and modern attention/decoding optimizations. This new chapt...
Efficiency wave: GPT-5.4 mini lands in ChatGPT, and NVIDIA/Hugging Face ship a real-world SD benchmark
OpenAI is pushing smaller, faster LLMs in ChatGPT while NVIDIA and Hugging Face release a benchmark to measure real speedups from speculative decoding...
Faster, cheaper LLM serving: prompt caching and P-EAGLE in vLLM
Two practical levers promise big LLM serving gains: prompt caching and a reported P‑EAGLE integration in vLLM for speculative decoding. A clear expla...
Speculative decoding: 3x faster LLM serving with a draft-and-verify path
Speculative decoding runs a small draft model to propose tokens and uses the main model to verify them, keeping outputs identical to baseline while cu...