According to a report by Science and Technology Daily, researchers from the Institute of Automation at the Chinese Academy of Sciences have recently achieved an important breakthrough. They have for the first time confirmed that multimodal large language models can spontaneously "understand" things during the training process, and their understanding method is very similar to human cognition. This discovery not only opens up a new path for us to explore the thinking mechanism of artificial intelligence but also lays the foundation for developing artificial intelligence systems that can understand the world like humans in the future. The research results have been published in the journal Nature Machine Intelligence.
Understanding is the core of human intelligence. When we see "dog" or "apple," besides recognizing their appearance features such as size, color, and shape, we also understand their functions, feelings they evoke, and cultural significance. This all-encompassing ability to understand is the basis of our perception of the world. With the rapid development of large models like ChatGPT, scientists have begun to consider whether these models can learn to "understand" things like humans through vast amounts of text and images.
Image Source Note: Image generated by AI, authorized service provider Midjourney
Traditional artificial intelligence research has focused more on object recognition accuracy, with little discussion on whether the model truly "understands" the essence of objects. Researcher He Huiguang from the Chinese Academy of Sciences pointed out that although current artificial intelligence can distinguish between cat and dog pictures, the essential difference between this "recognition" and human understanding of cats and dogs still requires in-depth study.
In this study, the research team drew inspiration from the cognitive principles of the human brain and designed an interesting experiment: having large models play the "spot-the-difference" game with humans. They selected three items' concepts from 1854 common items and asked participants to identify the least fitting one. By analyzing 4.7 million judgment data points, researchers created the first "mind map" or "concept map" of large models.
The study revealed that scientists summarized 66 key perspectives representing artificial intelligence's "understanding" of things. These perspectives are not only easy to explain but also highly consistent with the neural activity patterns in the human brain responsible for processing objects. More importantly, multimodal models capable of handling both text and images are closer to humans in their "thinking" and selection processes.
Interestingly, when humans make judgments, they consider both the appearance features of objects and their meanings or functions, while large models rely more on the "text labels" and abstract concepts they have acquired. This finding indicates that large models have indeed developed a way of understanding the world similar to humans, opening a new chapter in artificial intelligence's understanding capabilities.