In recent years, the field of artificial intelligence has undergone tremendous changes, particularly with large language models (LLMs) making significant progress in multi-modal tasks. These models demonstrate powerful potential in understanding and generating language, but most current multi-modal models still adopt autoregressive (AR) architectures, which limit their inference process to be relatively monotonous and lacking in flexibility. To address this limitation, a research team from The University of Hong Kong and Huawei Noah’s Ark Lab has proposed a novel model called FUDOKI.
The core innovation of FUDOKI lies in its entirely new non-masked discrete flow matching architecture. Unlike traditional autoregressive models, FUDOKI achieves bidirectional information integration through parallel denoising mechanisms, significantly enhancing its performance in complex reasoning and generation tasks. This model not only bridges the gap between image generation and text understanding but also achieves unified modeling for both domains.
Figure source note: Image generated by AI, provided by Midjourney
This model's advantage is its mask-free design, making the generation process more flexible. During inference, FUDOKI allows dynamic adjustment of the generation results, as if it had learned human-like thinking patterns. Moreover, FUDOKI performs exceptionally well in image generation, achieving a score of 0.76 on the GenEval benchmark, surpassing same-sized autoregressive models and demonstrating high-quality generation effects and semantic accuracy.
The construction of FUDOKI relies on metric-induced probabilistic paths and optimal kinetic velocity. These technologies enable the model to consider the semantic similarity of each token during the generation process, resulting in more natural text and image generation. Additionally, during training, FUDOKI uses pre-trained autoregressive models for initialization, reducing training costs and improving efficiency.
The introduction of FUDOKI not only provides a new perspective for multi-modal generation and understanding but also lays a more solid foundation for the development of general artificial intelligence. In the future, we look forward to FUDOKI bringing further exploration and breakthroughs, driving the continued advancement of AI technology.