In the modern healthcare system, electronic health records (EHRs) are undoubtedly the core data format, recording key information about patients from diagnosis to treatment. These data not only provide decision support for doctors but also drive the development of medical artificial intelligence. Recently, a research team from Nanyang Technological University introduced the first comprehensive benchmark for evaluating the ability of large language models (LLMs) to handle EHRs — EHRStruct, marking a significant step forward in medical AI research.

The EHRStruct benchmark includes 11 core tasks with a total of 2,200 samples. The tasks are designed with clinical scenarios, cognitive levels, and functional categories in mind, forming a rigorous evaluation framework. Researchers noted that general-purpose large models performed well when handling structured EHRs, surpassing models specifically designed for the medical field. They also found that data-driven tasks showed stronger performance, and input formats and fine-tuning methods significantly influenced model performance.

In the evaluation, the research team conducted a systematic comparison of 20 mainstream LLMs and 11 enhancement methods. The results showed that combining the EHRMaster framework with the Gemini model significantly improved the performance of LLMs in handling structured EHRs, even surpassing current state-of-the-art models. This research has been accepted by the AAAI 2026 conference and is expected to attract widespread attention in future academic exchanges.

To promote development in this area, the research team also launched the "EHRStruct 2026 - LLM Structured Electronic Health Record Challenge," aiming to provide researchers with a unified and comparable evaluation platform and to promote in-depth research on the ability of LLMs to process structured EHRs.

The establishment of EHRStruct can be divided into four stages: task synthesis, task system construction, task sample extraction, and evaluation process setup. Medical experts and computer scientists worked together to ensure the clinical relevance and reproducibility of the evaluation. This evaluation framework not only has scientific rigor but also provides rich data support for future research.

The release of this important study not only provides new tools and methods for the advancement of medical AI but also offers more reliable support for future clinical decision-making and data analysis. We look forward to more medical AI applications being implemented in practical work, achieving more efficient healthcare services.