Articles

AI outperformed doctors in real-world diagnoses.

A patient arrives at the hospital with a pulmonary embolism—a blood clot in the lungs. Although they initially improve, their condition soon worsens, leading doctors to suspect the treatment isn’t working.

An AI system reviews the patient’s records and identifies a possible history of lupus, an autoimmune disease that can cause heart inflammation, as the underlying issue.

The AI’s assessment turns out to be correct.

Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center found that an AI reasoning model developed by OpenAI performed exceptionally well in diagnosing patients and guiding care decisions—often matching or surpassing doctors and even earlier models like GPT-4.

The team tested the model using real clinical cases, including a lupus patient treated in the emergency department in Boston. They evaluated its diagnostic accuracy at different stages, from initial ER triage to hospital admission.

Overall, the AI outperformed two experienced physicians, despite having access only to the same limited electronic health record data available to them.

“This is the key takeaway—it works with messy, real-world emergency department data,” said Dr. Adam Rodman, noting its effectiveness in making accurate diagnoses in real clinical settings.

Other parts of the study examined case reports from the New England Journal of Medicine and clinical vignettes to evaluate whether the AI could meet established diagnostic benchmarks and handle complex cases.

According to Raj Manrai, the model outperformed a large group of physicians in these tests. However, the researchers noted that the AI relied solely on text, whereas real-world clinicians also interpret images, sounds, and nonverbal cues during diagnosis and treatment.

Even so, the results highlight how far AI has advanced. Earlier models often struggled with uncertainty and generating accurate differential diagnoses.

“This study clearly shows how much progress has been made,” said Dr. David Reich of Mount Sinai Health System, who was not involved. He added that while the technology appears highly accurate and potentially ready for real use, the challenge now is integrating it into clinical workflows in ways that genuinely improve care.

Experts also caution that excelling at complex diagnoses does not fully reflect real-world medicine, where outcomes are often more nuanced. Emergency care represents only a small part of a patient’s overall treatment journey, and performance may vary with more extensive medical histories.

Importantly, the researchers stress that these findings do not support replacing doctors with AI, but rather highlight its potential as a supportive tool in clinical practice.

He added that this signals a major technological shift that could reshape medicine.

However, the findings also highlight the need for rigorous testing of AI models—ideally through forward-looking clinical trials—to better understand their real-world impact on healthcare.

Designing such trials is complex, Dr. David Reich noted, but he described the study as a strong call to action.