Apple recently introduced a new image model called Manzano in its research, which is designed to handle both image understanding and generation simultaneously. This dual capability is a technical challenge faced by many open-source models, and Apple states that this makes it more comparable to commercial systems such as those provided by OpenAI and Google in terms of efficiency and performance in image processing.

image.png

Currently, Manzano has not been released to the public or demonstrated publicly. However, Apple's research team shared a research paper along with some low-resolution image samples that demonstrate the model's ability to handle complex prompts. These samples were compared with outputs from the open-source model Deepseek Janus Pro and the commercial systems GPT-4o and Gemini 2.5 Flash Image Generation (also known as "Nano Banana"). In tests with three challenging prompts, Manzano performed comparably to OpenAI's GPT-4o and Google's Nano Banana.

Apple points out that the current core limitation of most open-source models lies in the fact that they often have to choose between strong image analysis and generation capabilities, while commercial systems can handle both. Especially when dealing with tasks that involve a lot of text, such as reading documents or interpreting charts, existing models perform particularly poorly.

Manzano's design uses a hybrid image tokenizer, a core concept that allows it to output two types of tokens: continuous tokens and discrete tokens. Continuous tokens represent images using floating-point numbers for understanding, while discrete tokens divide the image into fixed categories for generation. Since both tokens come from the same encoder, this reduces conflicts that may occur in traditional models.

During the training phase, Manzano integrates continuous and discrete adapters to adjust the language model's decoder. During inference, it provides two data streams required for understanding and generating images. The architecture of Manzano mainly consists of three parts: the hybrid tokenizer, a unified language model, and an independent image decoder for the final output. Apple built three different image decoders with varying parameter counts: 90 million, 175 million, and 352 million parameters, supporting resolutions ranging from 256 to 2048 pixels.

Apple's test results show that Manzano performs well on multiple benchmarks, especially in text-heavy tasks such as chart and document analysis, with the 3 billion parameter version scoring particularly well. The study also found that as the number of model parameters increased from 300 million to 3 billion, performance continued to improve.

image.png

Manzano is not only capable of handling classic image editing tasks but can also perform new tasks such as prompt-based editing, style transfer, image inpainting, expansion, and depth estimation. Apple believes that Manzano is a viable alternative to existing models, and its modular design may have a profound impact on future multimodal AI.

Paper: https://arxiv.org/abs/2509.16197

Key Points:   

🌟 Manzano is a new image model that can perform both image understanding and generation simultaneously.   

🔍 Apple's research shows that Manzano performs well in handling complex text tasks, approaching the level of commercial systems.   

⚙️ The model uses a hybrid image tokenizer, reducing conflicts between image understanding and generation.