On February 11, Ant Group open-sourced its multi-modal large model Ming-Flash-Omni2.0. In various public benchmark tests, the model has shown outstanding performance in visual language understanding, speech-controlled generation, image generation and editing, and other key capabilities, surpassing Gemini 2.5 Pro in some metrics, setting a new benchmark for open-source multi-modal large models.

Ming-Flash-Omni2.0 is also the first industry-wide audio unified generation model that can simultaneously generate speech, ambient sound effects, and music on the same audio track. Users need only give natural language instructions to finely control parameters such as voice, speaking speed, intonation, volume, emotion, and dialect. The model achieves an ultra-low inference frame rate of 3.1Hz during the reasoning phase, enabling real-time high-fidelity generation of minute-long audio, maintaining industry-leading efficiency and cost control.

image.png

(Figure caption: Ming-Flash-Omni-2.0 has achieved open-source leading levels in core areas such as visual language understanding, speech-controlled generation, image generation and editing.)

Industry experts generally believe that multi-modal large models will eventually move toward a more unified architecture, allowing different modalities and tasks to achieve deeper collaboration. However, the reality is that "multi-modal" models often struggle to be both general and specialized: in specific single capabilities, open-source models often fall short of specialized ones. Ant Group has been continuously investing in multi-modal research for years, and the Ming-Omni series has evolved accordingly: early versions built a unified multi-modal capability base, mid-versions verified the ability improvements brought by scale growth, and the latest 2.0 version, through larger-scale data and systematic training optimization, has pushed the open-source multi-modal understanding and generation capabilities to an industry-leading level, even surpassing top specialized models in some areas.

The open-sourcing of Ming-Flash-Omni2.0 means that its core capabilities are being released in the form of a "reusable base," providing a unified capability entry point for end-to-end multi-modal application development.

Ming-Flash-Omni2.0 is trained based on the Ling-2.0 architecture (MoE, 100B-A6B), and it has been comprehensively optimized around three goals: "seeing more accurately, hearing more precisely, and generating more stably." In terms of vision, it integrates billions of fine-grained data and challenging example training strategies, significantly improving the recognition ability of complex objects such as similar animals, craft details, and rare cultural relics; in terms of audio, it enables simultaneous generation of speech, sound effects, and music on the same track, supports natural language fine control of parameters such as voice, speaking speed, and emotion, and has zero-shot voice cloning and customization capabilities; in terms of images, it enhances the stability of complex editing, supports light and shadow adjustment, scene replacement, character pose optimization, and one-click photo editing, maintaining image coherence and realistic details even in dynamic scenes.

Zhou Jun, head of the Bai Ling model team, stated that the key to multi-modal technology lies in achieving deep integration and efficient utilization of multi-modal capabilities through a unified architecture. After open-sourcing, developers can reuse visual, speech, and generation capabilities based on the same framework, significantly reducing the complexity and cost of multi-model integration. In the future, the team will continue to optimize video temporal understanding, complex image editing, and real-time long audio generation, improve the toolchain and evaluation system, and promote the large-scale application of multi-modal technology in actual business scenarios.

Currently, the model weights and inference code of Ming-Flash-Omni2.0 have been released on open-source communities such as Hugging Face. Users can also experience and call them through Ant's official platform Ling Studio.