In today’s increasingly competitive field of artificial intelligence, the Seed team from ByteDance officially released its latest multimodal large model, Seed1.5-VL, on May 13. This model aims to pave the way for advancements in agent technology. After being pre-trained with over 3 trillion tokens of multimodal data, it not only has strong general multimodal understanding and reasoning capabilities but also significantly reduces inference costs.

Compared to Google's recently launched Gemini2.5Pro, Seed1.5-VL performs equally well in terms of performance. Google's Gemini2.5Pro supports unified understanding of images, videos, audio, and code, leading GPT-4.0 in multiple benchmark tests. The Seed team from ByteDance stated that despite having only 20 billion activated parameters, Seed1.5-VL achieved the latest optimal performance (SOTA) in 38 out of 60 public evaluation benchmarks, including winning 14 out of 19 video benchmark tests and 3 out of 7 GUI (graphical user interface) agent tasks.

image.png

In specific capabilities, Seed1.5-VL demonstrates excellent visual reasoning, image question answering, and video understanding abilities. In tasks related to agents, the model achieved SOTA results in 7 GUI tasks. Additionally, Seed1.5-VL simplifies the architecture design, reducing computational requirements, making it more suitable for interactive applications. It can complete complex tasks such as information collection and processing smoothly on different platforms like PCs and mobile phones.

image.png

However, Seed1.5-VL still faces some challenges. In fine-grained visual perception, the model encountered some difficulties when counting objects, identifying differences in images, and explaining complex spatial relationships, especially when dealing with irregular arrangements, similar colors, or partial occlusions. Moreover, the model sometimes makes unsupported assumptions or incomplete responses in high-level reasoning tasks, indicating room for improvement in these areas.

Despite these challenges, the release of Seed1.5-VL marks ByteDance's continuous progress in multimodal technology. The model is now available via API on Volcano Engine, allowing users to directly experience this new technology.