Recently, OpenAI released a new open-source evaluation framework named HealthBench, aimed at measuring the performance and safety of large language models (LLMs) in real medical scenarios. This framework was developed with support from 262 doctors across 60 countries and 26 medical specialties, seeking to address the shortcomings of existing evaluation standards, particularly in real-world applications, expert validation, and diagnostic coverage.

QQ_1747118377933.png

Existing medical AI evaluation standards often rely on narrow, structured formats like multiple-choice exams. While these formats can be helpful for initial assessments, they fail to capture the complexity and subtleties of real clinical interactions. HealthBench, however, adopts a more representative assessment model, including 5,000 multi-round dialogues between models and general users or medical professionals. Each dialogue ends with a user question, and the model responses are scored according to specific evaluation criteria written by doctors.

QQ_1747118245591.png

The HealthBench evaluation framework is divided into seven key topics, including emergency referrals, global health, health data tasks, seeking context, targeted communication, depth of answers, and responses under uncertainty. Each topic represents different challenges in medical decision-making and user interaction. In addition to standard evaluations, OpenAI also introduced two variants:

1. HealthBench Consensus: Emphasizes 34 validated standards by doctors, aiming to reflect key aspects of model behavior, such as recommending urgent care or seeking more context.

2. HealthBench Hard: A more challenging subset containing 1,000 selected dialogues, designed to test the capabilities of current state-of-the-art models.

Evaluations were conducted on various models, including GPT-3.5Turbo, GPT-4o, GPT-4.1, and the updated o3 model. The results showed significant progress: GPT-3.5 scored 16%, GPT-4o scored 32%, while o3 reached 60%. Notably, the smaller, cost-effective GPT-4.1nano model outperformed GPT-4o while reducing inference costs by 25 times.

The evaluation results also revealed differences in model performance across various topics and evaluation dimensions. Emergency referrals and targeted communication were relatively strong areas, while seeking context and completeness presented greater challenges. OpenAI compared model outputs with those of doctors, finding that unassisted doctors generally generated lower-scoring responses but made progress when handling model-generated drafts, especially with earlier versions of the models.

HealthBench also includes mechanisms to evaluate model consistency to ensure the reliability of the results. OpenAI's meta-assessment using over 60,000 annotated examples indicated that GPT-4.1, as the default evaluator, performed no worse than individual doctors in most topics, demonstrating its potential as a consistent evaluator.

Project: https://github.com/openai/simple-evals

Key Points:

- 🩺 OpenAI launched HealthBench, focusing on evaluating large language models in the medical field, with participation and validation from 262 doctors.

- 🔍 HealthBench’s evaluation covers seven key topics, involving 5,000 real dialogues to provide more detailed analysis of model behavior.

- 📊 Evaluation results show significant differences in model performance, with GPT-4.1nano performing well at lower costs, showcasing the potential of models as clinical tools.