Recently, Google's research team introduced a new vision-language model called PaLI-3. This model, despite having fewer parameters compared to larger models, demonstrates superior performance. The research utilizes a contrastive pre-trained image encoder, which enables PaLI-3 to excel in various localization and text comprehension tasks. PaLI-3 has achieved top results on multiple visual question answering datasets, showcasing its robust multimodal understanding capabilities. The study compared classification pre-training with contrastive pre-training and found that the latter leads to more efficient vision-language models.