On April 3, the MiTi team officially released the native multimodal large model LongCat-Next. This model breaks through the traditional "language foundation + plugin" architecture by converting images, speech, and text into the same source discrete Token, allowing AI to "see" and "hear" the physical world natively, just like processing text.

Technical Core: DiNA Architecture Achieves "Modality Internalization"

To break down the barriers between modalities, MiTi has built the DiNA (Discrete Native Autoregressive) architecture, achieving deep unification in multimodal modeling:

  • Full Modality Unification: Whether it's text, images, or audio, the model uses the same set of parameters, attention mechanisms, and loss functions.

  • Symmetry of Understanding and Generation: Under a unified mathematical form, predicting text Tokens is "understanding," while predicting image Tokens is "generation." Both show significant collaborative potential during training.

  • Extreme Compression: Using the dNaViT Visual Tokenizer, it supports arbitrary resolution inputs and achieves a pixel space compression of up to 28 times through 8 layers of residual vector quantization, preserving key details in tasks such as OCR and financial report parsing.

Empirical Performance: Discrete Modeling Has No "Ceiling"

LongCat-Next demonstrates performance surpassing specialized models across multiple dimensions, effectively refuting the traditional view that "discretization inevitably leads to information loss":

  • Fine-Grained Perception: In dense text scenarios on OmniDocBench, its performance not only exceeds Qwen3-Omni but also outperforms the specialized visual model Qwen3-VL.

  • Visual Reasoning: It achieved an impressive score of 83.1 on MathVista, demonstrating strong industrial-level logical capabilities.

  • Cross-Modal Collaboration: While maintaining leading language capabilities (C-Eval 86.80), it supports low-latency parallel text and speech generation and customizable voice cloning.

Industry Insight: The Foundation for Physical World AI

For a long time, large models have been language-centered systems. The significance of LongCat-Next lies in proving that physical information can be discretized and modeled like language. When AI has a unified "native language," it becomes smarter and more intuitive when calling tools, writing code, and understanding complex charts.

Currently, MiTi has open-sourced the LongCat-Next model and the dNaViT tokenizer. This compact, high-potential native discrete architecture will provide important tools for developers to build AI capable of perceiving and acting upon the real world.