Recently, Stanford University released a comprehensive evaluation of clinical medical AI models. DeepSeek R1 stood out among nine advanced large models with a 66% win rate and a macro average score of 0.75, becoming the champion. The highlight of this evaluation is that it not only focuses on traditional medical license exam questions but also delves into the daily work scenarios of clinical doctors, providing a more practical assessment.
The evaluation team developed an integrated evaluation framework called MedHELM, which includes 35 benchmarks covering 22 subcategories of medical tasks. This framework was validated by 29 practicing doctors from 14 medical specialties to ensure its rationality and practicality. Ultimately, the evaluation results revealed the superior performance of DeepSeek R1, followed by o3-mini and Claude3.7Sonnet.
In particular, DeepSeek R1 demonstrated stable performance in various benchmark tests, with a standard deviation of only 0.10 in the win rate, indicating its stability across different tests. o3-mini performed notably in the clinical decision support category, achieving a 64% win rate and a highest macro average score of 0.77, ranking second. Other models like Claude3.5 and 3.7Sonnet followed closely behind with win rates of 63% and 64%, respectively.
Notably, this evaluation innovatively adopted the Large Language Model Jury (LLM-jury) method for result assessment, showing high consistency with the scores given by clinical doctors, proving its effectiveness. Additionally, the research team conducted a cost-benefit analysis, finding that the usage cost of inference models is relatively high while non-inference models are more cost-effective, suitable for users with different needs.
This evaluation not only provides valuable data support for the development of medical AI but also offers more possibilities and flexibility for future clinical practices.