On September 1st, Meituan officially launched the LongCat-Flash series model and recently open-sourced two versions, LongCat-Flash-Chat and LongCat-Flash-Thinking, which have attracted a lot of developers' attention. Today, the LongCat team announced the release of a new family member - LongCat-Flash-Omni. This model has achieved multiple technological innovations based on the original foundation, marking a new era of full-modal real-time interaction.
LongCat-Flash-Omni is based on the efficient architecture design of the LongCat-Flash series and adopts the latest Shortcut-Connected MoE (ScMoE) technology, integrating efficient multimodal perception modules and speech reconstruction modules. Despite the total parameters of 560 billion (27 billion activated parameters), it can still provide low-latency real-time audio and video interaction capabilities. This breakthrough provides developers with more efficient multimodal application scenarios solutions.

According to comprehensive evaluation results, LongCat-Flash-Omni performs excellently in full-modal benchmark tests, reaching the state-of-the-art (SOTA) level of open-source models. The model demonstrates strong competitiveness in key single-modal tasks such as text, image, video understanding, and speech perception and generation, achieving the goal of "no intelligence reduction across all modalities."
LongCat-Flash-Omni adopts an integrated full-modal architecture, combining offline multimodal understanding and real-time audio-video interaction capabilities. Its design philosophy is fully end-to-end, using visual and audio encoders as multimodal sensors, which can directly generate text and speech tokens, and reconstruct natural speech waveforms through a lightweight audio decoder, ensuring low-latency real-time interaction.
In addition, the model introduces a progressive early multimodal fusion training strategy to address the heterogeneity of different modal data distributions in full-modal model training. This strategy ensures effective collaboration between modalities and promotes overall model performance improvement.
In specific performance tests, LongCat-Flash-Omni has shown excellent performance in multiple fields, especially in text and image understanding tasks, where its capabilities not only did not decline but also achieved significant improvements. In audio and video processing, the model's performance is also outstanding, particularly in the naturalness and smoothness of real-time audio and video interaction, leading many open-source models.
The LongCat team also provides users with new experience channels, allowing users to experience image, file upload, and voice call functions through the official website. At the same time, the official LongCat App is now available, supporting online search and voice calls, and will launch video call features in the future.
Hugging Face:
https://huggingface.co/meituan-longcat/LongCat-Flash-Omni
Github:
https://github.com/meituan-longcat/LongCat-Flash-Omni

