StepFun's Step 3.7 Flash: Revolutionizing AI with Multimodal MoE Model and Cost-Effective Advisor Mode

May 30, 2026
StepFun's Step 3.7 Flash: Revolutionizing AI with Multimodal MoE Model and Cost-Effective Advisor Mode
  • We’re looking at StepFun’s Step 3.7 Flash, a 198B sparse Mixture-of-Experts (MoE) model with a 11B active parameter footprint per token, featuring native multimodal support and an Advisor Mode that delivers strong performance at low cost.

  • In independent benchmarks, Step 3.7 Flash leads on ClawEval-1.1 and long-context AA-LCR, signaling superior performance across multiple tasks.

  • The model includes multimodal tools: a Visual Search Tool achieving top results on SimpleVQA with Search and a Python Tool for fine-grained visual tasks, with emergent compositional tool use observed.

  • Architecturally, it uses a separate 1.8B ViT encoder to inject image representations into the language context, along with three selectable reasoning depths (low, medium, high) to balance latency and depth.

  • Agentic coding shows gains: SWE-Bench Pro at 56.26%, Terminal-Bench 2.1 at 59.55%, and SWE-MTLG at 72.42%, with cross-harness variance narrowed for better predictability.

  • Advisor Mode enables end-to-end agentic loops with selective escalation to a larger advisor model, achieving about 97% of Claude Opus 4.6’s coding performance for roughly a fifth of the cost per task.

  • StepFun released Step 3.7 Flash to emphasize reliability in tool use and vision inputs compared with earlier 3.5 Flash iterations.

  • Pricing and access: token costs are tiered—$0.20 per million input cache misses, $0.04 per million input cache hits, and $1.15 per million outputs—with multiple backends and platforms supporting deployment.

  • Integrated search and reasoning benchmarks show valuable context, including HLE with Tools at 47.20% and other BrowseComp/DeepSeek variants for evaluation.

  • Overall architecture combines a 196B language backbone with a 1.8B vision encoder and MoE activation of ~11B parameters per token, maintaining compute near a dense 11B model within the 198B budget.

Summary based on 1 source


Get a daily email with more AI stories

More Stories