VLLM PUB_DATE: 2026.07.05

LOCAL LLM SERVING ON 24GB GPUS: VLLM SCALES, LLAMA.CPP/OLLAMA SURVIVE SPILLS

A new benchmark shows vLLM crushes throughput on a 24GB GPU but hard-OOMs once models spill to RAM, while llama.cpp and Ollama keep generating slowly. In a hom...

A new benchmark shows vLLM crushes throughput on a 24GB GPU but hard-OOMs once models spill to RAM, while llama.cpp and Ollama keep generating slowly.

In a home-lab test on an RTX 3090, vLLM scaled aggregate throughput 3.9x–5.4x from 1→8 concurrent requests with paged attention, beating llama.cpp by 2.9x–3.7x at c8 when models fit in 24GB. But when models exceeded VRAM, vLLM consistently OOMed, while llama.cpp and Ollama degraded to single-digit tok/s and still produced tokens.

During RAM spill, llama.cpp’s manual layer offload beat Ollama’s automatic split by 37x on time-to-first-token, with similar steady-state decode speed. Two related videos point to growing local viability—Poolside’s open-weight coding model runs on a MacBook video, and budget models can rival pricey ones in some tasks video—so the serving stack choice now directly affects capacity planning and SLOs.

[ WHY_IT_MATTERS ]
01.

If your model fits in VRAM, vLLM delivers far higher concurrency and throughput than llama.cpp/Ollama.

02.

If it doesn’t, vLLM fails hard while llama.cpp/Ollama degrade gracefully—this alters reliability plans and SLO math.

[ WHAT_TO_TEST ]
  • terminal

    Run your target model/context on vLLM vs llama.cpp/Ollama at 1, 4, and 8 concurrency; measure TTFT, P50/P95 latency, and tok/s.

  • terminal

    Force a spill (bigger model or KV cache) and observe failure modes; verify autoscaling, backpressure, and retries.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    If you have mixed GPU memory footprints, prefer llama.cpp/Ollama for spill-tolerant tiers and vLLM for on-VRAM hot paths.

  • 02.

    Add guards: memory headroom checks, circuit breakers, and a fallback route when vLLM approaches OOM.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    If models fit on your cards, standardize on vLLM for throughput and queue efficiency.

  • 02.

    Design a two-lane serving plan: vLLM for on-GPU workloads, llama.cpp/Ollama as a spill-tolerant lane for oversized jobs.

Enjoying_this_story?

Get daily VLLM + SDLC updates.

  • Practical tactics you can ship tomorrow
  • Tooling, workflows, and architecture notes
  • One short email each weekday

FREE_FOREVER. TERMINATE_ANYTIME. View an example issue.

GET_DAILY_EMAIL
AI + SDLC // 5 MIN DAILY