Visual Language Models (VLMs), you've probably heard of them. These AI whizzes aren't just good at reading text; they can also "see" and understand images. However, the truth is not quite that simple. Today, let's take a peek under their "skirts" and see if they truly understand images like humans do.
Firstly, let's clarify what VLMs are. In simple terms, they are large language models, such as GPT-4o and Gemini-1.5Pro, which excel in image and text processing and even score high in many visual