In Harvard Study, AI Outperformed Human Doctors in Emergency Room Diagnoses

Study Overview

A new study led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center was published this week in the journal Science. The research team conducted a series of experiments to evaluate how OpenAI’s AI models compared to human physicians across various medical scenarios.

Experimental Design

The study focused on 76 patients who visited the Beth Israel emergency room, comparing diagnoses from two internal medicine attending physicians to those generated by OpenAI’s o1 and 4o models. The diagnoses were independently assessed by two additional attending physicians who were blinded to whether each diagnosis came from a human or an AI.

The researchers emphasized that they did not “pre-process the data at all” — the AI models received the exact same information available in the electronic medical records at the time of each diagnosis.

Key Findings

The o1 model performed particularly well at the initial ER triage stage — the point at which patient information is most limited and decisions are most urgent:

The o1 model provided the “exact or very close diagnosis” in 67% of triage cases
The first physician achieved 55% accuracy
The second physician achieved 50% accuracy

“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said Arjun Manrai, who heads an AI lab at Harvard Medical School and is one of the study’s lead authors.

Important Limitations

The researchers explicitly stated that the study does not claim AI is ready to make real life-or-death decisions in the emergency room. Instead, they called for “an urgent need for prospective trials to evaluate these technologies in real-world patient care settings.”

The study also noted that it only tested how models performed with text-based information, and that “existing studies suggest that current foundation models are more limited in reasoning over non-text inputs.”

Ethical and Accountability Concerns

Adam Rodman, a Beth Israel physician and co-lead author, warned that there is “no formal framework right now for accountability” around AI diagnoses, and that patients still “want humans to guide them through life or death decisions and through challenging treatment decisions.”

Kristen Panthagani, an emergency physician commenting on the study, called it “an interesting AI study that has led to some very overhyped headlines,” noting that the comparison was against internal medicine physicians rather than ER specialists. “If we’re going to compare AI tools to physicians’ clinical ability, we should start by comparing to physicians who actually practice that specialty,” she said.

She added: “As an ER doctor seeing a patient for the first time, my primary goal is not to guess your ultimate diagnosis. My primary goal is to determine if you have a condition that could kill you.”

Looking Ahead

This study marks a significant milestone in AI’s potential for medical diagnostics. However, substantial hurdles remain before clinical deployment, including regulatory approval, liability frameworks, and the ability of models to process non-text medical data such as imaging and lab results.

Sources: TechCrunch, Science

In Harvard Study, AI Outperformed Human Doctors in Emergency Room Diagnoses#

Study Overview#

Experimental Design#

Key Findings#

Important Limitations#

Ethical and Accountability Concerns#

Looking Ahead#