AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

OpenAI Releases HealthBench: A New Standard for Evaluating the Performance of Large Language Models in the Medical Field

AIbase基地

Published inAI News · 5 min read · May 13, 2025

Recently, OpenAI released a new open-source evaluation framework named HealthBench, aimed at measuring the performance and safety of large language models (LLMs) in real medical scenarios. This framework was developed with support from 262 doctors across 60 countries and 26 medical specialties, seeking to address the shortcomings of existing evaluation standards, particularly in real-world applications, expert validation, and diagnostic coverage.

Existing medical AI evaluation standards often rely on narrow, structured formats like multiple-choice exams. While these formats can be helpful for initial assessments, they fail to capture the complexity and subtleties of real clinical interactions. HealthBench, however, adopts a more representative assessment model, including 5,000 multi-round dialogues between models and general users or medical professionals. Each dialogue ends with a user question, and the model responses are scored according to specific evaluation criteria written by doctors.

The HealthBench evaluation framework is divided into seven key topics, including emergency referrals, global health, health data tasks, seeking context, targeted communication, depth of answers, and responses under uncertainty. Each topic represents different challenges in medical decision-making and user interaction. In addition to standard evaluations, OpenAI also introduced two variants:

1. HealthBench Consensus: Emphasizes 34 validated standards by doctors, aiming to reflect key aspects of model behavior, such as recommending urgent care or seeking more context.

2. HealthBench Hard: A more challenging subset containing 1,000 selected dialogues, designed to test the capabilities of current state-of-the-art models.

Evaluations were conducted on various models, including GPT-3.5Turbo, GPT-4o, GPT-4.1, and the updated o3 model. The results showed significant progress: GPT-3.5 scored 16%, GPT-4o scored 32%, while o3 reached 60%. Notably, the smaller, cost-effective GPT-4.1nano model outperformed GPT-4o while reducing inference costs by 25 times.

The evaluation results also revealed differences in model performance across various topics and evaluation dimensions. Emergency referrals and targeted communication were relatively strong areas, while seeking context and completeness presented greater challenges. OpenAI compared model outputs with those of doctors, finding that unassisted doctors generally generated lower-scoring responses but made progress when handling model-generated drafts, especially with earlier versions of the models.

HealthBench also includes mechanisms to evaluate model consistency to ensure the reliability of the results. OpenAI's meta-assessment using over 60,000 annotated examples indicated that GPT-4.1, as the default evaluator, performed no worse than individual doctors in most topics, demonstrating its potential as a consistent evaluator.

Project: https://github.com/openai/simple-evals

Key Points:
- 🩺 OpenAI launched HealthBench, focusing on evaluating large language models in the medical field, with participation and validation from 262 doctors.
- 🔍 HealthBench’s evaluation covers seven key topics, involving 5,000 real dialogues to provide more detailed analysis of model behavior.
- 📊 Evaluation results show significant differences in model performance, with GPT-4.1nano performing well at lower costs, showcasing the potential of models as clinical tools.

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team