Stepfun's Starry Team announced the open source release of its latest generation foundational large model, Step3. Step3 is a model designed for enterprises and developers who pursue the optimal balance between performance and cost, aiming to create the most suitable model for the inference era. The open source addresses of this model include Github, Hugging Face, and ModelScope, allowing developers to freely download and experience it.

Step3 adopts a MoE architecture, with a total parameter count of 321 billion and an activated parameter count of 38 billion. It not only has strong visual perception and complex reasoning capabilities, but also accurately completes cross-domain complex knowledge understanding, cross-analysis of mathematics and visual information, as well as various visual analysis problems in daily life. Through the optimization of MFA (Multi-matrix Factorization Attention) and AFD (Attention-FFN Disaggregation), Step3 significantly improves inference efficiency on various chips. In addition, the StepMesh communication library for AFD scenarios has also been open-sourced along with the model, providing a standard deployment interface across hardware, supporting stable performance replication in actual services.

WeChat screenshot_20250801082013.png

The core structure of Step3 uses a self-developed MFA attention mechanism, effectively reducing the KV cache overhead and computing power consumption in attention calculations. This solution achieves a balance between resource utilization and inference efficiency without sacrificing model capability, making it feasible for large throughput inference on 8×48GB GPUs, and has real deployment feasibility. Regarding multimodal capabilities, Step3 uses a 5B Vision Encoder and employs a two-layer 2D convolution to downsample visual features, reducing the number of visual tokens to 1/16 of the original, alleviating the pressure of context length and improving inference efficiency. The training process is divided into two stages: the first stage enhances the encoder's perception, while the second stage freezes the visual encoder and only optimizes the main body and connection layers to reduce gradient interference. Training data includes Pair, Interleave, and multi-task data, and during the cleaning phase, similarity filtering, resampling, and task ratio control are introduced to further enhance the quality of image-text collaboration and training robustness.

Step3 restructured the decoding process at the system architecture level, focusing on solving the inference bottlenecks and resource mismatch issues caused by the mixed execution of Attention and FFN. To address this, the team implemented a high-performance AFD solution, decoupling the two types of computing tasks into two subsystems and using multi-level pipeline parallel scheduling to effectively improve overall throughput efficiency. Due to the high data transmission requirements between the decoupled subsystems, the team also developed the StepMesh communication library for AFD scenarios, which realizes low-latency and high-bandwidth transmission across cards based on GPU Direct RDMA, while also having the advantages of not occupying GPU computing resources and being compatible with various heterogeneous hardware. Under the SLA of 50ms decoding, Step3's throughput on Hopper GPU reaches 4039 token/gpu/s, significantly higher than DeepSeek V3 under similar settings (2324 token/gpu/s), and this performance gain can be further amplified by up to 300% on specific hardware and long text scenarios.

Step3 was tested on evaluation sets such as MMMU, MathVision, SimpleVQA, AIME2025, GPQA-Diamond, and LiveCodeBench (from August 2024 to May 2025). Among similar open-source models, Step3 achieved industry-leading results. For example, in the task of "arranging a business banquet seat," Step3 can identify the structure in the image, automatically parse etiquette rules, role relationships, and spatial logic, then combine Chinese social etiquette to infer the complete 12-person role distribution logic, finally outputting a global seating arrangement plan of "host-guest and host-attendant" with clear roles, clear positions, and reasonable structures, and presenting it intuitively with a table + ASCII diagram. In the calorie calculation task, Step3 can understand complex receipts, categorize dishes, match calories, and finally estimate that two people had a total of 5710 kcal for one meal, averaging 2855 kcal per person. From the original data to the conclusion explanation, the entire process has clear logic and forms a complete loop.

The Step3 API is now available on the Stepfun Open Platform (platform.stepfun.com). Developers can also experience it on the "Stepfun AI" website (stepfun.com) and the "Stepfun AI" App (search and download from the app store). There is a limited-time discount for the model, and all requests are calculated at the lowest price, with the cost as low as 1.5 yuan per million input tokens and 4 yuan per million output tokens.

Github: https://github.com/stepfun-ai/Step3

Hugging Face: https://huggingface.co/stepfun-ai/step3

ModelScope:

https://www.modelscope.cn/models/stepfun-ai/step3

https://www.modelscope.cn/models/stepfun-ai/step3-fp8