Gemini 3.1 Flash

Key Highlights

✓Configurable thinking levels — dial reasoning depth per request
✓$0.50/M input tokens with 90.4% on PhD-level GPQA Diamond
✓1M token context window with full multimodal support
✓Native tool use and structured output for agentic workflows

The pitch

Gemini 3.1 Flash is Google's answer to a question the industry has been dancing around: why should developers choose between intelligence and speed? Instead of shipping separate Pro and Flash models and forcing you to pick, Google built configurable reasoning into Flash itself.

The thinking_level parameter — minimal, low, medium, or high — lets you control how much internal reasoning the model performs on each request. A simple classification task gets minimal thinking. A multi-step coding problem gets high. Same model, same API call, different compute allocation.

Why it matters

The AI pricing wars have been about cost per token. Gemini 3.1 Flash shifts the conversation to cost per reasoning step. At $0.50 per million input tokens and $3.00 per million output tokens, it's priced like a utility model but performs like a reasoning one.

On GPQA Diamond (PhD-level reasoning), it scores 90.4%. On MMMU Pro (multimodal understanding), 81.2%. These are numbers that would have been flagship-tier six months ago. Now they're available at Flash pricing.

Architecture

The 1M token context window handles full codebases, long documents, and extended conversations without chunking. Multimodal input support covers text, images, audio, video, and PDFs — all in a single model call.

For developers, the real feature is native tool use and structured output. Gemini 3.1 Flash handles function calling, JSON schema enforcement, and multi-turn agent loops without the reliability issues that plague smaller models. Context caching is automatic, reducing costs further for repeated interactions.

Configurable thinking

This is the genuinely novel part. Traditional model selection forces a binary: fast-and-cheap or slow-and-smart. The thinking_level parameter makes this a continuous spectrum within a single deployment.

A chatbot handling customer FAQs runs at minimal. A code review agent runs at high. An agentic workflow might start at low for information gathering, then escalate to high for synthesis. All on the same model endpoint, no routing logic needed.

The trade-offs

Still in preview — expect occasional instability and API changes. The 3.1 Flash designation suggests a point release, not a generation leap, so don't expect it to match Gemini 3.1 Pro on the hardest benchmarks.

Google's model naming has become genuinely confusing. Gemini 3 Flash, 3.1 Flash, 3 Pro, 3.1 Pro — each with preview/stable variants. Developers need to track deprecation schedules (3 Pro shuts down March 9, 2026).

Who this is for

Developers building agentic workflows who need reasoning flexibility without model-switching overhead. Teams running multi-turn coding assistants where latency matters. Anyone currently paying for Pro-tier models on tasks that don't consistently need Pro-tier reasoning.

Verdict

Gemini 3.1 Flash's configurable thinking is the kind of feature that seems obvious in retrospect. Instead of maintaining separate model deployments for different complexity tiers, you get one endpoint with a dial. At Flash pricing with near-Pro performance, it's the model Google should have shipped first.

Explore MenFem