Microsoft has officially open-sourced its latest multi-modal reasoning model, Phi-4-reasoning-vision-15B. With a parameter scale of 15B, the model achieves an ideal balance between high performance and low cost while maintaining a lightweight design, offering a new option for complex visual tasks in resource-constrained environments.

A "small cannon" driven by refined data

Different from industry models that typically consume trillions of tokens, Phi-4-reasoning-vision was trained using only 200B multi-modal tokens. The development team prioritized data quality, through deep cleaning of open-source data, generating targeted synthetic data, and precise domain data ratio (such as increasing math data to simultaneously improve computer operation capabilities), making it perform excellently in scientific reasoning and screen positioning tasks.

image.png

Innovative hybrid reasoning strategy

A major highlight of this model is the "hybrid reasoning path" design:

  • Perception tasks: When handling simple tasks such as image description and OCR, the model defaults to a direct answer mode, effectively reducing latency.

  • Reasoning tasks: When facing complex logic such as mathematical formulas and scientific charts, the model automatically calls a structured chain-of-thought (CoT) path to ensure accuracy of answers.

    Users can also manually switch between these two modes using specific guiding words to adapt to different scenarios.

Thanks to the integration of the SigLIP-2 dynamic resolution encoder, the model has strong perception ability for small elements in high-resolution screenshots. This makes it an ideal choice for developing computer operation assistants (CUA), capable of accurately identifying and operating buttons and input fields on web or mobile interfaces.

Currently, Phi-4-reasoning-vision-15B has been released on multiple open-source platforms. Microsoft hopes that this compact model will prove that in the multi-modal field, "smaller and faster" can coexist with "stronger," further promoting the popularity of spatial intelligence and real-time interaction technologies.