VLLM
30 days · UTC
Synchronizing with global intelligence nodes...
Agentic coding grows up: open‑weights MiniMax M2.7 meets Grok’s tool‑calling workflows
Open-weights MiniMax M2.7 and xAI’s tool-calling Grok push agentic coding from demos to production workflows. NVIDIA detailed the open-weights releas...
LLMOps Part 14: Practical LLM Serving and vLLM in Production
A new LLMOps chapter explains how to serve models in production and walks through practical trade-offs, including vLLM-based deployments. Part 14 of ...
The practical playbook for faster, cheaper LLM inference: vLLM, KV caches, and decoding tricks
A hands-on deep dive shows how to speed up and scale LLM inference with vLLM, KV caching, and modern attention/decoding optimizations. This new chapt...
SWE-CI shifts agent evaluation from one-shot bug fixes to CI-driven maintainability
A new CI-loop benchmark, SWE-CI, measures whether AI coding agents can maintain real repositories over time, not just pass one-off tests. [SWE-CI](ht...
Faster, cheaper LLM serving: prompt caching and P-EAGLE in vLLM
Two practical levers promise big LLM serving gains: prompt caching and a reported P‑EAGLE integration in vLLM for speculative decoding. A clear expla...
GLM 4.7 claims stronger coding agents and tool use
A recent video reports the release of GLM 4.7, an open-source LLM from China, claiming improved reliability for coding agents and tool use. Independen...