Phi-4-multimodal-instruct is a lightweight open-source multimodal foundation model that integrates language, vision, and speech research and datasets from Phi-3.5 and 4.0 models. It supports text, image, and audio inputs to generate text outputs, with a context length of 128K tokens.
Multimodal
TransformersMultiple Languages