National Standard Unveiled for Scalable, Safe Healthcare AI

Researchers at Duke University School of Medicine have developed two innovative frameworks to assess the performance, safety, and reliability of large language models in healthcare.

Published in npj Digital Medicine and the Journal of the American Medical Informatics Association (JAMIA), two new studies present a novel approach to ensuring that AI systems used in clinical environments adhere to the highest standards of quality, safety, and accountability.

As large language models become more integrated into healthcare—supporting tasks such as clinical note generation, conversation summarization, and patient communication—health systems face increasing challenges in evaluating these technologies in a rigorous yet scalable way. The Duke University-led research, headed by Chuan Hong, Ph.D., assistant professor in Biostatistics and Bioinformatics, aims to address this critical need.

The study published in npj Digital Medicine introduces SCRIBE, a structured evaluation framework for Ambient Digital Scribing tools. These AI-driven systems are designed to generate clinical documentation by capturing real-time conversations between patients and providers. SCRIBE combines expert clinical review, automated performance scoring, and simulated edge-case testing to assess tools across key metrics such as accuracy, fairness, coherence, and resilience.

“Ambient AI has significant potential to ease documentation burdens for clinicians,” Hong noted. “But careful evaluation is crucial. Without it, there’s a risk of deploying systems that introduce bias, omit vital details, or compromise care quality. SCRIBE is built to safeguard against those risks.”

A second, related study published in JAMIA introduces a complementary framework for evaluating large language models integrated into the Epic electronic medical record system, specifically those used to generate draft responses to patient messages. The study assesses these AI-generated replies by comparing clinician feedback with automated evaluation metrics, focusing on attributes such as clarity, completeness, and safety.

While the models demonstrated strong performance in tone and readability, the study identified notable gaps in response completeness—highlighting the critical need for ongoing evaluation in real-world settings.

“This research helps bridge the gap between cutting-edge algorithms and meaningful clinical application,” said Michael Pencina, Ph.D., Chief Data Scientist at Duke Health and co-author of both studies. “It underscores that responsible AI implementation requires rigorous, ongoing evaluation as part of the technology’s entire life cycle—not just as a final step.”

Together, these two frameworks provide a robust foundation for the responsible integration of AI in healthcare. They equip clinical leaders, developers, and regulators with the tools necessary to evaluate AI models prior to deployment and to continuously monitor their performance—ensuring that these technologies enhance care delivery without compromising patient safety or trust.

Post Views: 434

Press Releases

First Care Medical Clinic Enhances Efficiency with eClinicalWorks and Sunoh AI

AdvancedMD offers small mental health practices an affordable cloud platform

Health New Zealand Picks dRofus to Standardize Healthcare Planning

NextGen and Eyefinity Collaborate to Enhance Offerings for Optometrists and Ophthalmologists

Research Papers

Reactions to EHR-Based Clinical Study Invitations

IDC MarketScape evaluates U.S. EHR vendors

The Rise of Electronic Medical Records: Transforming Global Healthcare

Bridging the EMR Gap in Developing Healthcare Systems

Events Calendar

Events

National Standard Unveiled for Scalable, Safe Healthcare AI

Press Releases

Research Papers

Events Calendar

Events

You may also like