At the Force Link AI Innovation Tour held in Shanghai, ByteDance officially released its latest visual-language multimodal model — Seed1.5-VL. This model has become the highlight of the event due to its outstanding general multi-modal understanding and reasoning capabilities, attracting the attention of many industry experts and developers.
The prominent feature of Seed1.5-VL is its enhanced multi-modal understanding and reasoning ability. Compared with previous versions, Seed1.5-VL has significantly improved in terms of speed and accuracy for visual positioning and reasoning. Additionally, the new video understanding and multi-modal agent functions make it perform even better when handling complex tasks.
Superior Performance with Cost Efficiency
Despite having only 20B activation parameters, Seed1.5-VL has performance comparable to Gemini2.5Pro. On 60 public benchmark evaluations, Seed1.5-VL achieved state-of-the-art (SOTA) results on 38 tasks, especially excelling in video understanding, visual reasoning, and multi-modal agent capabilities, leading the industry.
In terms of inference costs, Seed1.5-VL also performs excellently, with an input price of just 0.003 yuan per thousand tokens and an output price of 0.009 yuan per thousand tokens, offering great value for money.
Convenient API Access
Currently, Seed1.5-VL is fully open for API access on Volcano Engine. Developers can quickly access its capabilities by logging in and selecting Doubao-1.5-thinking-vision-pro to build their own AI visual assistants, inspection systems, interactive agents, or next-generation smart cameras.
To verify the actual performance of Seed1.5-VL, reporters conducted multiple tests. By uploading a shelf image, Seed1.5-VL could quickly identify specific products and calculate their prices. In complex civil service graphic reasoning questions, Seed1.5-VL also demonstrated its powerful reasoning ability, capturing and deducing patterns within a short time to complete difficult logical tasks.
As the latest generation of multi-modal models in the Seed series, Seed1.5-VL has undergone pre-training on over 3T tokens of multi-modal data, showcasing excellent performance in tasks such as image question answering, chart understanding, and visual reasoning. The model consists of three core components: the visual encoding module SeedViT, a multilayer perceptron (MLP) adapter for visual feature projection, and a large language model Seed1.5-LLM based on MoE architecture.
Github: https://github.com/ByteDance-Seed/Seed1.5-VL
https://seed.bytedance.com/zh/tech/seed1_5_vl