OpenAI Healthbench Report

Artificial intelligence is rapidly making inroads into healthcare—from AI chatbots answering patient questions to decision-support systems aiding clinicians. For hospital executives and health IT leaders, this wave of AI offers tremendous promise and peril. On one hand, AI tools like large language models (LLMs) could expand access to medical information and streamline clinical workflows.

On the other hand, a single unsafe or incorrect AI-generated recommendation could have dire consequences for patient safety. How can healthcare organizations harness AI’s potential while ensuring trust, accuracy, and safety?

Enter OpenAI’s HealthBench, an ambitious new benchmark designed to evaluate AI systems in health care settings. Launched in May 2025, HealthBench provides a standardized, rigorous way to test how well AI models perform on realistic medical scenarios. Experts have lauded this open-source project as “unprecedented” in scope, calling it a milestone toward safer AI in medicine. This report explores what HealthBench is, how it works, why it matters for healthcare leaders, and how it fits into the broader landscape of AI evaluation and patient safety.

“Right now, in healthcare, arguably we already have the technology to fundamentally change what it looks like. The key barrier is going to be adoption, and it's great to see more tools like this come out to help make that easier.”
— Justin Norden, M.D., Stanford Medicine

The Need for Rigorous AI Evaluation in Healthcare

Healthcare is a high-stakes domain. If a generative AI system gives the wrong dosage advice or misses a critical symptom, the fallout can be life-threatening. Yet until recently, there was no universal yardstick to judge whether an AI model is truly ready for clinical use. Many early benchmarks for medical AI were limited in scope. For example, popular benchmarks like MedQA, MedMCQA, or the USMLE exam questions showed that advanced models could score highly on multiple-choice tests – in some cases nearing expert-level performance. But passing an exam is not the same as handling real patient dialogues or clinical decisions. As Dr. Ethan Goh of Stanford AI Research noted, these benchmarks have become “saturated and less useful for measuring improvement (i.e., AI models are scoring close to 100%)”. In other words, today’s top models can ace board-style questions, leaving little room to distinguish further progress.

More importantly, multiple-choice or short-form tests fail to capture the messy reality of healthcare interactions. Patients often describe symptoms in roundabout ways; clinicians must ask follow-up questions; cultural and linguistic nuances abound. Previous evaluations often “do not reflect realistic scenarios, lack rigorous validation against expert medical opinion, or leave no room for state-of-the-art models to improve,” OpenAI observed. This gap means an AI could look “perfect” on paper but still stumble in a real conversation with a frightened patient or a complex case. For healthcare CIOs, CMOs, and risk officers, this is a serious concern – how do you know an AI tool won’t generate a harmful recommendation under unusual circumstances?

Compounding the issue, many organizations have rushed prototypes of AI tools into pilot use without robust evaluation, driven by hype or competitive pressure. The result is a trust gap: clinicians and boards remain (rightfully) skeptical of AI, having seen grand claims but little evidence. “Clinicians in the healthcare system are rightly skeptical… it’s on the onus of the community to show and track performance of these tools,” explains Dr. Justin Norden. In a risk-averse field like medicine, adoption will stall unless we can prove that AI systems are safe, effective, and reliable in the context that matters: real patient care. This is the backdrop against which OpenAI’s HealthBench emerged – to fill the evaluation gap and help set a shared standard for trustworthy AI in health care.

What Is HealthBench? A New Benchmark for Healthcare AI

HealthBench is the industry’s first comprehensive benchmark tailored specifically for healthcare AI. Created by OpenAI’s health AI team, it moves beyond simplistic tests and instead simulates realistic medical dialogues to evaluate models on what physicians' care about most: accuracy, safety, clinical relevance, and effective communication. In essence, HealthBench is a massive stress test for medical AI systems, grounded in real-world scenarios.

At its core, HealthBench consists of 5,000 multi-turn clinical conversations that an AI might encounter. These aren’t just trivial Q&As – they cover a broad swath of healthcare situations: primary care consultations, emergency triage calls, mental health dialogues, specialist referrals, telehealth chats, and more. The conversations were generated through a mix of physician scripting, synthetic variation, and even adversarial testing (where experts try to find questions that trick the AI). Crucially, they were selected for difficulty – reflecting thorny, nuanced cases that would challenge even a seasoned clinician. They also span multiple languages, 26 medical specialties, and a range of patient and provider personas (from laypersons with low health literacy to veteran nurses). This diversity ensures that the benchmark isn’t biased toward one narrow use case; instead, it mirrors the complex tapestry of real healthcare interactions.

What truly sets HealthBench apart is its rigorous scoring rubric. For each of the 5,000 scenarios, a panel of physicians crafted a detailed checklist of what an ideal AI response should include or avoid, specific to that case. In total, HealthBench encompasses over 48,000 individual rubric criteria – covering everything from factual medical accuracy, to demonstrating empathy, to avoiding unsafe advice or unnecessary jargon. Each criterion is weighted by importance, and an AI’s answer earns points based on whether it meets each guideline. For example, in a chest pain scenario, criteria might include: Did the AI ask about heart attack warning signs? Did it avoid giving a definitive diagnosis prematurely? Did it advise the patient to seek emergency care if needed? The model’s response is graded against each such item.

Scoring at scale is handled by another AI – a GPT-4.1-based grader – that has been calibrated to apply the physician-written rubrics to any model’s answer. This automated grading approach allows rapid evaluation of thousands of responses, and OpenAI reports that these AI-driven scores align closely with human physicians’ judgments. In other words, the evaluation process itself has been validated for trustworthiness, with the team finding “similar agreement between the AI grader and doctors as between two doctors” on the same case. The end result is that each model tested on HealthBench gets a score indicating how much of the physician-approved criteria it satisfied, along with a breakdown of performance in different domains (accuracy, safety, communication) and scenarios.

By designing HealthBench to be meaningful, trustworthy, and “unsaturated,” OpenAI aimed to create a benchmark that truly drives progress. Meaningful means the scores reflect real-world impact – focusing on complex, realistic cases rather than trivia. Trustworthy means the evaluation is grounded in the standards of healthcare professionals (via physician-written rubrics and validation). And unsaturated means even the best current AI models struggle to get perfect scores – leaving ample room for future improvement. In fact, when OpenAI evaluated several of its latest models on HealthBench, they found that while newer models (like GPT-4.1 and others) outperform older ones, there are still many scenarios where even top models falter or only partially succeed. This indicates HealthBench is challenging enough to raise the bar and incentivize developers to create safer and more capable AI systems.

Finally, it’s important to note that HealthBench is open-source and extensible. The full dataset, evaluation code, and rubric definitions are openly available on GitHub. This openness means that the healthcare community at large can scrutinize, improve, and contribute to the benchmark. It invites collaboration – hospitals, universities, and companies can even add new scenarios or criteria relevant to their domain. By releasing HealthBench openly, OpenAI is encouraging a shared standard for evaluating health AI, rather than a proprietary test. As Karan Singhal, who leads OpenAI’s health team, explained, HealthBench was developed both to “shape shared standards” for researchers and to give healthcare organizations “high-quality evidence” about AI tools’ capabilities and limitations. In short, HealthBench aims to be a common yardstick that everyone can use to measure and compare health AI systems – a critical step to building trust in these technologies across the industry.

How HealthBench Can Be Used in Practice

A benchmark is only as valuable as its real-world applications. So how can HealthBench help healthcare professionals, executives, and developers in concrete ways? As it turns out, quite a few:

Vetting AI Solutions for Patient Safety: Perhaps the most immediate use is by hospitals, health systems, and telehealth providers to audit third-party AI offerings. Before deploying an AI chatbot or clinical decision tool, an organization can test that model on HealthBench’s scenarios to see if it meets their safety and quality bar. In fact, some are already doing this – as one telehealth CTO noted, “we rely on HealthBench to vet new AI tools. The standardized scores give us confidence that any solution we deploy meets our rigorous safety standards.” Having a trusted external benchmark means a vendor can’t cherry-pick easy demos; they must prove performance on a broad array of tough cases. For healthcare CIOs and risk officers, this is a powerful way to ensure patient safety is not compromised by a shiny new AI. HealthBench’s focus on metrics like harmful content avoidance, medical accuracy, and appropriate follow-up recommendations directly translates to elevated patient safety in deployment.

Accelerating Regulatory Approval: Medical device manufacturers and digital health startups can incorporate HealthBench results into their regulatory submissions to the FDA and other agencies. Regulators worldwide (FDA, EMA, MHRA, etc.) are increasingly asking for evidence of an AI’s performance and risk mitigation in real clinical contexts. By including a HealthBench report, companies can demonstrate that their model was rigorously evaluated against physician standards and found to perform acceptably. Priya Singh, a director of regulatory affairs at an AI medtech company, mentioned that including HealthBench results “significantly smoothed” conversations with FDA reviewers. In an era of evolving AI regulations, using a respected benchmark can give both vendors and regulators greater confidence that an AI tool is safe and effective, thereby smoothing the path to approval.

Internal Model Improvement and Monitoring: AI development teams – whether in a university research lab or a health IT vendor – can use HealthBench as a continuous yardstick during model training. For instance, an NLP team at a healthcare startup might benchmark different model versions on HealthBench to see which architecture or fine-tuning approach yields the best medical reasoning and lowest error rates. Because HealthBench has an automated evaluation pipeline, it can be integrated into a CI/CD (continuous integration) workflow, enforcing performance gates for any model update. An enterprise could set a policy that no AI model is deployed into production unless it scores above a certain threshold on critical HealthBench scenarios (say, zero critical errors on the “Emergency Situations” theme). This kind of governance by metrics ensures that as models evolve, they don’t regress in safety or quality. Moreover, HealthBench can help pinpoint specific weaknesses – for example, a model might excel in general primary care advice but stumble on mental health questions, alerting the team to where additional training or guardrails are needed.

Demonstrating Value to Stakeholders: For health tech companies, having strong HealthBench results can be a selling point. Startups have begun to showcase their benchmark scores in pitches to investors or clients. It’s a way of saying, “Our AI isn’t just cool – it’s been objectively measured against the toughest standard in the industry and proven itself.” This can reassure hospital buyers (the CIOs, CMOs, etc.) that a product has been battle-tested. Similarly, academic researchers publishing a new medical AI algorithm can cite HealthBench evaluations to substantiate claims of improvement. Over time, we may see HealthBench (and similar benchmarks) become a de facto reference in research papers and vendor brochures alike, which in turn drives broader adoption of rigorous evaluation.

Aligning Teams and Building Trust: One underrated benefit of a framework like HealthBench is how it can bridge communication between diverse stakeholders. In healthcare AI projects, you often have data scientists, clinicians, compliance officers, and executives all in the mix. HealthBench provides a common language of quality. A clinician might not grok the intricacies of a transformer model, but they do understand the importance of a test where the AI had to handle a difficult patient case. By reviewing HealthBench scenario results together, teams can collectively identify where the AI needs improvement and agree on whether it’s safe to use. OpenAI even built collaborative features (shared workspaces, annotation tools) into the HealthBench platform to support this cross-functional use. The ultimate result is data-driven decisions about AI deployment, rather than decisions based on gut feeling or hype. For boards and C-suites, this kind of transparent evidence can be the difference between trusting an AI enough to green-light it, versus shelving a project over uncertainty.

In summary, HealthBench is not just an academic exercise – it’s already finding practical utility across the healthcare ecosystem, from R&D to regulatory to clinical operations. By adopting HealthBench in evaluation workflows, healthcare organizations can significantly reduce the risk of harmful or inaccurate AI behavior before it ever reaches patients and align their innovation efforts with the highest standards of quality and compliance.

Limitations and Complementary Approaches

While HealthBench is a groundbreaking step forward, it’s not a cure-all for evaluating every aspect of healthcare AI. As leaders considering this framework, it’s important to understand what HealthBench does and doesn’t cover, and how it fits alongside other evaluation needs.

Firstly, HealthBench focuses on conversational and decision-support tasks – essentially, how well an AI can handle medical Q&A and give advice in a chat/dialogue format. This is crucial for patient-facing tools (like symptom triage chatbots or virtual health assistants) and clinician support tools that answer questions. However, many real-world healthcare AI applications are not conversations at all, but rather involve automation of administrative or clinical workflows. For example, an AI system might auto-generate clinical notes, extract billing codes from charts, or flag abnormal results in a lab stream. These workflow-centric tasks are not explicitly tested by HealthBench. As James Griffin, a healthcare AI CEO, pointed out, “It doesn’t measure whether your model assigns the correct ICD-10 codes, generates compliant SOAP notes, or correctly parses a referral fax… The majority of real-world healthcare AI today is about reducing administrative burden”. In other words, if you’re building or buying an AI to automate prior auth paperwork, HealthBench’s patient dialogue scenarios won’t tell you if that AI does its job well.

This means that healthcare organizations will still need domain-specific benchmarks and tests for their particular use cases. HealthBench can be one pillar of your evaluation strategy – especially for anything involving clinical reasoning or communication – but it won’t replace, say, an internal test suite to verify that an AI documentation assistant correctly captures all critical patient information in the EHR. “If you’re developing clinical AI tools, your internal benchmark should reflect the actual tasks your system is performing…tied directly to the operational outcomes your clients care about,” Griffin advises. For a hospital CIO, this might mean defining KPIs like “AI-generated notes require <5% physician edits” or “coding suggestions have 98% billing acceptance,” and testing those specifically – on top of using HealthBench for the more general safety and knowledge checks.

Secondly, HealthBench itself, while comprehensive, is still a first iteration. It covers a wide range of scenarios, but there may be gaps. For instance, certain ultra-specialized medical domains or rare edge-case scenarios might not be fully represented in the 5,000 dialogues. The good news is that the open-source nature of the benchmark allows the community to extend it. We’re likely to see new scenario packs or rubric criteria contributed over time as practitioners identify new needs. Already, academic groups like Stanford’s Center for Research on Foundation Models have developed Medical HELM, a benchmark targeting clinical tasks within health systems (like reading EHR notes or clinical decision-making). One of MedHELM’s creators, Dr. Nigam Shah, notes that HealthBench is “directionally aligned in spirit” with their efforts and is “highly complementary” – focusing more on patient-facing interactions outside the hospital, whereas MedHELM looks at inside-the-hospital use cases. Together, such benchmarks start to cover the spectrum of healthcare AI contexts. For executives, the takeaway is that no single benchmark will cover every angle – but collectively, a set of well-designed evaluations (like HealthBench, MedHELM, etc.) can give robust assurance.

Thirdly, we should remember that benchmarks are proxies for reality. HealthBench’s simulated chats, no matter how realistic, are still not live clinical deployments. An AI that scores well on the benchmark is likely to be safer and more effective than one that doesn’t, but it’s not a guarantee of flawless real-world performance. Actual deployment introduces factors like integration with IT systems, real patient data quality, user interface design, and the unpredictability of human behavior. Therefore, ongoing monitoring remains critical. Organizations should treat a benchmark like HealthBench as necessary but not sufficient – a high score is a green light to proceed to careful pilot testing, not to skip directly to full deployment. Continuous post-deployment surveillance (for example, monitoring AI outputs on real cases for any anomalies) is recommended, echoing the “monitoring across health systems” approach championed by the Coalition for Health AI (CHAI).

In summary, HealthBench is a milestone, not a panacea. It establishes a much-needed common foundation and pushes the industry standard upward, but healthcare leaders must still do their due diligence beyond it. That means creating internal benchmarks tied to your workflows, engaging in industry efforts to define best practices (such as CHAI’s work on AI model cards and assurance frameworks), and staying mindful of the contexts not covered by any given test. Used appropriately, HealthBench can be a powerful tool in your arsenal – as long as you understand its scope and integrate it into a broader strategy for AI governance.

The Road Ahead: Toward Safer AI and Shared Standards

OpenAI’s HealthBench has set into motion a new chapter in healthcare AI – one focused on accountability and evidence. Its emergence reflects a broader maturation in the field: a recognition that we must measure what matters when it comes to AI in medicine. So what comes next, and what should healthcare organizations be doing?

For one, we can expect evaluation frameworks to continue evolving. HealthBench itself may expand (it’s open to community contributions), and we’ll likely see more specialized benchmarks for different niches of healthcare. The ultimate vision is an ecosystem of benchmarks covering everything from chatbot bedside manner to diagnostic image analysis – all openly shared. As Dr. Nigam Shah pointed out, when multiple institutions use and contribute to shared benchmarks, it “creates a strong incentive to create [even more] shared benchmarking datasets” and a virtuous cycle of improvement. In the near future, asking “What’s the HealthBench score of this model?” could become as common as asking about its HIPAA compliance or its cloud security posture.

We also anticipate that regulators and professional bodies will incorporate benchmarks into their guidance. The FDA, for example, has been working on an adaptive regulatory framework for AI/ML in health. If benchmarks like HealthBench gain wide acceptance, agencies could reference them as suggested evaluation methods or even require evidence from such tests in approval processes. Likewise, hospital accrediting organizations and malpractice insurers might start looking for benchmark testing as part of risk management protocols. All this will further normalize the idea that rigorous AI validation is part and parcel of healthcare innovation, not an optional extra.

For healthcare leaders today, a key next step is to embrace a culture of AI evaluation and transparency. This means asking tough questions of any AI tool: How do we know it works? What’s the evidence? Insist that vendors provide validation data – and if they cite something like HealthBench, understand what that entails. Encourage your internal teams to test AI models independently, and share results (successes and failures alike) with the community. The field is moving fast, and nobody has all the answers, so a spirit of openness will benefit everyone. As OpenAI’s team put it, “evaluations like HealthBench are part of our ongoing efforts to understand model behavior in high-impact settings and help ensure progress is directed toward real-world benefit”. We should all be part of those efforts.

Finally, never lose sight of why we pursue AI in healthcare: to improve patient outcomes, increase access, and ease the burden on providers. Benchmarks and tests are a means to that end – guardrails to ensure we do no harm on the way to doing a lot of good. The fact that the latest models can match or even surpass physician-written answers on many HealthBench scenarios is astonishing, but equally important is identifying where they still fail and how to close those gaps. The path to trustworthy AI is a journey of continuous improvement. With initiatives like HealthBench, we’ve taken an important step on that path, but many steps remain. By continuing to “question everything” and hold our technology to the highest standards, we can navigate toward a future where AI truly fulfills its promise in healthcare – safely, ethically, and effectively.

Illuminis Labs: Guiding Your AI Strategy for Safety and Impact

At Illuminis Labs, our mission is to bridge cutting-edge technology with real-world healthcare needs – safely and strategically. We’ve seen the excitement around AI in medicine, but also the confusion and concern it can bring to leadership teams. Whether you’re a hospital CIO exploring an AI chatbot, a COO looking to streamline operations with automation, or a CISO worried about the risks, having the right AI strategy and guardrails is essential. That’s where we come in:

AI Strategy & Evaluation Consulting: We help you develop a clear AI roadmap tailored to your organization’s goals, aligning innovative tools with patient safety and compliance. Our experts can introduce frameworks like HealthBench and other benchmarks into your evaluation process, ensuring you have data-driven evidence before making deployment decisions. We’ll work with your clinicians, IT staff, and compliance officers to identify the metrics that matter for your use case and set up the internal benchmarks and testing protocols to hit those targets.

Safety Audits & Risk Mitigation: Drawing on our deep expertise in AI safety and cybersecurity, Illuminis Labs can stress-test your AI systems before they go live. Through AI red teaming, adversarial simulations, and rigorous scenario testing, we identify failure points or vulnerabilities in AI models. We’ll check not only performance on benchmarks like HealthBench, but also things like data security, privacy (HIPAA compliance), and resilience against manipulation. Our goal is to ensure that any AI tool you adopt is hardened against errors and exploits – giving you and your board peace of mind.

Implementation & Continuous Monitoring: Adopting AI isn’t a one-time event – it’s a lifecycle. Illuminis Labs provides end-to-end support for integrating AI solutions into your workflows. From vendor selection and pilot programs to training your staff on new AI-driven processes, we stand by you through rollout. We also help set up continuous monitoring systems so that as the AI is used in practice, you’ll get alerts to any drift in performance or emerging risks (for example, if an updated model starts behaving unexpectedly, you’ll know right away). By establishing this proactive oversight, we ensure your AI deployments remain reliable, effective, and compliant over time.

In a rapidly evolving landscape, having a partner who “questions everything” – and focuses on practical, safe outcomes – can make all the difference. Illuminis Labs can be that partner for you. We combine expertise in AI, healthcare, and security to help you navigate the complexities of modern health technology. From leveraging benchmarks like HealthBench to crafting bespoke evaluation metrics for your unique challenges, we equip your organization to innovate with confidence.

Ready to take the next step? If you’re looking to implement or scale AI in your healthcare enterprise and want to ensure you’re doing it right, reach out to us. We’ll help you sort out the hype from reality, build a solid strategy rooted in evidence and safety, and ultimately, harness AI in a way that truly benefits your clinicians and patients. In the quest to transform healthcare with AI, don’t go it alone – let Illuminis Labs help you engineer success at every stage.

Contact Illuminis Labs

Research Citations & Resources

OpenAI releases HealthBench to evaluate AI in healthcare

OpenAI unveils AI benchmark to evaluate health care models | STAT

HealthBench - Revolutionizing AI Evaluation in Healthcare

Introducing HealthBench | OpenAI

Why HealthTech Companies Should Ignore OpenAI's HealthBench

Illuminis Labs: Unmasking the Hidden Risks of Chinese LLMs in Critical Infrastructure