In today's rapidly advancing field of artificial intelligence, large language models (LLMs) have demonstrated extraordinary capabilities. However, how to scientifically assess their "mental" characteristics, such as values, personality, and social intelligence, remains an urgent challenge. Recently, Professor Song Guojie's team from Peking University released a comprehensive review paper that systematically sorts out the research progress in the psychometrics of large language models, providing a new perspective for AI evaluation.

This paper, titled "Psychometrics of Large Language Models: A Systematic Review of Assessment, Validation, and Enhancement," spans 63 pages and references 500 relevant articles. As the capabilities of LLMs continue to iterate rapidly, traditional assessment methods have become insufficient. The paper points out that current evaluations face multiple challenges, including LLM mental characteristics exceeding the scope of traditional assessments, fast model iterations rendering static benchmarks ineffective, and evaluation results being easily affected by minor changes. To address this, the team proposed a new approach by introducing psychometrics into AI evaluation.

image.png

Psychometrics has long been dedicated to quantifying complex psychological traits, providing support for educational, medical, and business decision-making through scientifically designed tests. Researchers found that applying its methodological approaches to LLM assessments will help deepen understanding and enhance AI mental capabilities. This methodological innovation opens up a completely new perspective for AI assessment, thereby promoting the development of the interdisciplinary field of "LLM Psychometrics."

The paper proposes three innovative directions: First, adopting an "construct-oriented" assessment approach to deeply explore latent variables influencing model performance; second, introducing rigorous methods from psychometrics to enhance the scientific rigor and interpretability of testing; third, utilizing item response theory to dynamically calibrate test item difficulty, making comparisons between different AI systems more scientific and fair.

In addition, the study also discusses human-like psychological constructs exhibited by LLMs, including personality traits and ability constructs, emphasizing their profound impact on model behavior. Through structured and unstructured testing formats, the team established a methodological foundation for assessing the "mental" capabilities of LLMs, providing strong theoretical support for future AI development.

Paper URL: https://arxiv.org/pdf/2505.08245