Local LLM serving on 24GB GPUs: vLLM sca…

VLLM PUB_DATE: 2026.07.05

LOCAL LLM SERVING ON 24GB GPUS: VLLM SCALES, LLAMA.CPP/OLLAMA SURVIVE SPILLS

A new benchmark shows vLLM crushes throughput on a 24GB GPU but hard-OOMs once models spill to RAM, while llama.cpp and Ollama keep generating slowly. In a hom...

A new benchmark shows vLLM crushes throughput on a 24GB GPU but hard-OOMs once models spill to RAM, while llama.cpp and Ollama keep generating slowly.

In a home-lab test on an RTX 3090, vLLM scaled aggregate throughput 3.9x–5.4x from 1→8 concurrent requests with paged attention, beating llama.cpp by 2.9x–3.7x at c8 when models fit in 24GB. But when models exceeded VRAM, vLLM consistently OOMed, while llama.cpp and Ollama degraded to single-digit tok/s and still produced tokens.

During RAM spill, llama.cpp’s manual layer offload beat Ollama’s automatic split by 37x on time-to-first-token, with similar steady-state decode speed. Two related videos point to growing local viability—Poolside’s open-weight coding model runs on a MacBook video, and budget models can rival pricey ones in some tasks video—so the serving stack choice now directly affects capacity planning and SLOs.

[ WHY_IT_MATTERS ]

01.

If your model fits in VRAM, vLLM delivers far higher concurrency and throughput than llama.cpp/Ollama.

02.

If it doesn’t, vLLM fails hard while llama.cpp/Ollama degrade gracefully—this alters reliability plans and SLO math.

[ WHAT_TO_TEST ]

terminal
Run your target model/context on vLLM vs llama.cpp/Ollama at 1, 4, and 8 concurrency; measure TTFT, P50/P95 latency, and tok/s.
terminal
Force a spill (bigger model or KV cache) and observe failure modes; verify autoscaling, backpressure, and retries.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
If you have mixed GPU memory footprints, prefer llama.cpp/Ollama for spill-tolerant tiers and vLLM for on-VRAM hot paths.
02.
Add guards: memory headroom checks, circuit breakers, and a fallback route when vLLM approaches OOM.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
If models fit on your cards, standardize on vLLM for throughput and queue efficiency.
02.
Design a two-lane serving plan: vLLM for on-GPU workloads, llama.cpp/Ollama as a spill-tolerant lane for oversized jobs.

Enjoying_this_story?

Get daily VLLM + SDLC updates.

Practical tactics you can ship tomorrow
Tooling, workflows, and architecture notes
One short email each weekday

arrow_back

PREVIOUS_DATA_LOG

Agentic AI is getting metered: prompt bloat and spend caps

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Binary chunk trees for RAG cut latency without extra LLM calls

arrow_forward