On August 31, the Shanghai Artificial Intelligence Laboratory (Shanghai AI Lab) announced the open-source release of the multimodal large model InternVL3.5, known as Shu Shen · Wanxiang. The model achieves a comprehensive upgrade in reasoning ability, deployment efficiency, and general capabilities through innovative cascade reinforcement learning (Cascade RL), dynamic visual resolution routing, and decoupled deployment architecture. InternVL3.5 has released full-scale versions from 1B to 241B parameters, setting a new benchmark for open-source model performance and achieving leading levels in various tasks.

The flagship model of InternVL3.5, InternVL3.5-241B-A28B, achieved the highest score of 77.7 points for open-source models on the multidisciplinary reasoning benchmark MMMU. It scored 77.9 points on the multimodal general perception benchmark MMStar and 90.7 points on OCRBench, surpassing GPT-5 (75.7 points/80.7 points). On the text reasoning benchmarks AIME25 and MMLU-Pro, it reached 75.6 and 81.3 points respectively, comprehensively outperforming existing open-source multimodal large models. Thanks to the cascade reinforcement learning framework, the overall reasoning performance of the entire series of models improved by an average of 16.0 points compared to the previous generation. Among them, the comprehensive reasoning performance of InternVL3.5-241B-A28B reached 66.9 points, exceeding the previous generation model's 54.6 points and Claude-3.7-Sonnet's 53.9 points, showing outstanding performance in complex tasks such as mathematical reasoning and logical reasoning.

WeChat screenshot_20250901092244.png

With the innovative visual resolution routing (ViR) and decoupled deployment framework (DvD), the 38B model significantly improved response speed at 896 resolution, with single inference latency reduced from 369ms to 91ms (an improvement of about 4 times). At the same time, the lightweight InternVL3.5-Flash maintained nearly 100% performance level while reducing the visual sequence length by 50%.

InternVL3.5 also enhanced core capabilities of intelligent agents, including GUI agents, embodied agents, SVG graphic understanding and generation. It surpassed mainstream open-source models on tasks such as ScreenSpot GUI positioning (92.9 points), VSI-Bench spatial reasoning (69.5 points), and SGP-Bench vector graphics understanding (70.6 points).

InternVL3.5 provides nine model sizes ranging from 1 billion to 241 billion parameters, covering different resource demand scenarios, including dense models and mixture-of-experts models (MoE). It is the first open-source multimodal large model that supports the GPT-OSS language model base. The official provides an example code for running the InternVL3.5-8B using `transformers`. The model can be deployed on a single A100 GPU, while the 38B model requires 2 A100 GPUs, and the 235B model requires 8 A100 GPUs.

ms-swift already supports training for the InternVL3.5 series models. ms-swift is a large model and multimodal large model training and deployment framework provided by the ModelScope community. Users can prepare data in a specific format for custom dataset fine-tuning. After training, they can perform inference using the corresponding command and push the model to ModelScope.

The release of InternVL3.5 marks another important advancement in multimodal large model technology, providing researchers and developers with powerful tools and promoting the development of multimodal artificial intelligence.

Code open source / model usage method:

https://github.com/OpenGVLab/InternVL

Model collection:

https://www.modelscope.cn/collections/InternVL35-Full-3871e58bf21349

Online experience:

https://chat.intern-ai.org.cn/