PaliGemma 2 is a vision-language model (VLM) that combines the capabilities of the Gemma 2 model. It can process both image and text inputs simultaneously and generate text outputs, supporting multiple languages. It is suitable for various vision-language tasks such as image and short video captioning, visual question answering, text reading, object detection, and object segmentation.
Multimodal
Transformers