Google debuts Gemini 3.1 Flash Lite: cheaper, faster model with variable reasoning

GOOGLE PUB_DATE: 2026.03.04

Google launched Gemini 3.1 Flash Lite, a cheaper and faster developer-focused model with variable reasoning now in preview via the Gemini API and Vertex AI. Goo...

Google launched Gemini 3.1 Flash Lite, a cheaper and faster developer-focused model with variable reasoning now in preview via the Gemini API and Vertex AI.
Google’s new model is designed for high-volume tasks and offers up to 2.5x faster time-to-first-token and 45% faster output than 2.5 Flash, with pricing at $0.25 per 1M input tokens and $1.50 per 1M output tokens, and it lets developers tune how much reasoning to use per request TechRadar. Simon Willison also notes the same pricing and highlights the four selectable “thinking” levels for controllable latency/cost trade-offs Simon Willison’s note.
For backend/data workloads, Google is pitching use cases like translation, content moderation, UI/dashboard generation, and simulations where throughput and unit economics dominate; the model is available in preview via the Gemini API in Google AI Studio and for enterprises in Vertex AI TechRadar. If you saw earlier summaries lacking specifics, note that some reports didn’t include pricing or benchmarks at first The AI Report.

[ WHY_IT_MATTERS ]

01.

Lower per-token pricing plus variable reasoning can materially cut GenAI costs for high-throughput services.

02.

Faster first-token and generation speed improve P95 latencies for synchronous backend endpoints.

[ WHAT_TO_TEST ]

terminal
Benchmark variable reasoning levels (minimal/low/medium/high) against your latency, accuracy, and cost SLOs on representative data tasks.
terminal
Model total cost of ownership at preview pricing with projected token volumes and compare to 2.5 Flash or current provider baselines.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Pilot Flash Lite behind a feature flag as a drop-in for existing 2.5 Flash calls and run A/B on throughput, quality, and spend.
02.
Update observability to capture per-request reasoning level, token usage, and P95/P99 latencies before widening rollout.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Design request paths to dynamically dial reasoning per task class to balance cost vs. quality from day one.
02.
Start in AI Studio for rapid prototyping and move to Vertex AI for governed deployment, quotas, and enterprise controls.

arrow_back

PREVIOUS_DATA_LOG

—

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

OpenAI ships GPT-5.3 Instant and targets secure deployments

arrow_forward