On April 3, the
Technical Core: DiNA Architecture Achieves "Modality Internalization"
To break down the barriers between modalities,
Full Modality Unification: Whether it's text, images, or audio, the model uses the same set of parameters, attention mechanisms, and loss functions.
Symmetry of Understanding and Generation: Under a unified mathematical form, predicting text Tokens is "understanding," while predicting image Tokens is "generation." Both show significant collaborative potential during training.
Extreme Compression: Using the dNaViT Visual Tokenizer, it supports arbitrary resolution inputs and achieves a pixel space compression of up to 28 times through 8 layers of residual vector quantization, preserving key details in tasks such as OCR and financial report parsing.
Empirical Performance: Discrete Modeling Has No "Ceiling"
Fine-Grained Perception: In dense text scenarios on OmniDocBench, its performance not only exceeds Qwen3-Omni but also outperforms the specialized visual model Qwen3-VL.
Visual Reasoning: It achieved an impressive score of 83.1 on MathVista, demonstrating strong industrial-level logical capabilities.
Cross-Modal Collaboration: While maintaining leading language capabilities (C-Eval 86.80), it supports low-latency parallel text and speech generation and customizable voice cloning.
Industry Insight: The Foundation for Physical World AI
For a long time, large models have been language-centered systems. The significance of
Currently,





