ByteDance has released a new open-source multimodal foundation model named BAGEL, featuring 7 billion active parameters and a total parameter count of 14 billion.
BAGEL has performed exceptionally well on standard multimodal understanding benchmarks, surpassing some of the top current open-source vision-language models like Qwen2.5-VL and InternVL-2.5. Additionally, in terms of text-to-image generation quality, BAGEL’s performance is comparable to the powerful professional generator SD3. More importantly, BAGEL outperforms many leading open-source models in classic image editing scenarios.
BAGEL adopts an architecture called Mixture of Transformers (MoT), designed to maximize the model's ability to learn from diverse multimodal information. It uses two independent encoders to capture pixel-level and semantic-level features of images. The overall framework follows the "next token group prediction" paradigm, aiming to predict the next language or visual token during training for compression purposes.
During pretraining, BAGEL leverages trillions of interleaved multimodal tokens from language, images, videos, and web data. After continuous training and supervised fine-tuning, BAGEL surpasses open-source models on standard understanding and generation benchmarks, showcasing advanced contextual multimodal capabilities such as free-form image editing, future frame prediction, 3D operations, and world navigation.
As BAGEL’s pretraining scales up, researchers have observed continuous performance improvements across understanding, generation, and editing tasks. Different capabilities emerge at various stages of training; early-stage performance highlights multimodal understanding and generation abilities, while more complex intelligent editing capabilities manifest later on.
Research indicates that combining Variational Autoencoders (VAEs) and Vision Transformers (ViTs) significantly enhances intelligent editing capabilities, underscoring the importance of visual-semantic context in complex multimodal reasoning.
Project: https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT
Key Takeaways:
🌟 BAGEL is an open-source multimodal foundation model with 7 billion active parameters, surpassing multiple standard benchmark tests.
🖼️ The model performs excellently in image generation and editing tasks, capable of free-form image editing and world navigation.
📈 Through multimodal pretraining, BAGEL demonstrates continuous performance improvement, adapting to complex multimodal reasoning tasks.