StepFun's Step 3.7 Flash: Revolutionizing AI with Multimodal MoE Model and Cost-Effective Advisor Mode
May 30, 2026
We’re looking at StepFun’s Step 3.7 Flash, a 198B sparse Mixture-of-Experts (MoE) model with a 11B active parameter footprint per token, featuring native multimodal support and an Advisor Mode that delivers strong performance at low cost.
In independent benchmarks, Step 3.7 Flash leads on ClawEval-1.1 and long-context AA-LCR, signaling superior performance across multiple tasks.
The model includes multimodal tools: a Visual Search Tool achieving top results on SimpleVQA with Search and a Python Tool for fine-grained visual tasks, with emergent compositional tool use observed.
Architecturally, it uses a separate 1.8B ViT encoder to inject image representations into the language context, along with three selectable reasoning depths (low, medium, high) to balance latency and depth.
Agentic coding shows gains: SWE-Bench Pro at 56.26%, Terminal-Bench 2.1 at 59.55%, and SWE-MTLG at 72.42%, with cross-harness variance narrowed for better predictability.
Advisor Mode enables end-to-end agentic loops with selective escalation to a larger advisor model, achieving about 97% of Claude Opus 4.6’s coding performance for roughly a fifth of the cost per task.
StepFun released Step 3.7 Flash to emphasize reliability in tool use and vision inputs compared with earlier 3.5 Flash iterations.
Pricing and access: token costs are tiered—$0.20 per million input cache misses, $0.04 per million input cache hits, and $1.15 per million outputs—with multiple backends and platforms supporting deployment.
Integrated search and reasoning benchmarks show valuable context, including HLE with Tools at 47.20% and other BrowseComp/DeepSeek variants for evaluation.
Overall architecture combines a 196B language backbone with a 1.8B vision encoder and MoE activation of ~11B parameters per token, maintaining compute near a dense 11B model within the 198B budget.
Summary based on 1 source
Get a daily email with more AI stories
Source

MarkTechPost • May 29, 2026
StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows