Researchers at Duke University School of Medicine have developed two innovative frameworks to assess the performance, safety, and reliability of large language models in healthcare.
Published in npj Digital Medicine and the Journal of the American Medical Informatics Association (JAMIA), two new studies present a novel approach to ensuring that AI systems used in clinical environments adhere to the highest standards of quality, safety, and accountability.
As large language models become more integrated into healthcare—supporting tasks such as clinical note generation, conversation summarization, and patient communication—health systems face increasing challenges in evaluating these technologies in a rigorous yet scalable way. The Duke University-led research, headed by Chuan Hong, Ph.D., assistant professor in Biostatistics and Bioinformatics, aims to address this critical need.
The study published in npj Digital Medicine introduces SCRIBE, a structured evaluation framework for Ambient Digital Scribing tools. These AI-driven systems are designed to generate clinical documentation by capturing real-time conversations between patients and providers. SCRIBE combines expert clinical review, automated performance scoring, and simulated edge-case testing to assess tools across key metrics such as accuracy, fairness, coherence, and resilience.
“Ambient AI has significant potential to ease documentation burdens for clinicians,” Hong noted. “But careful evaluation is crucial. Without it, there’s a risk of deploying systems that introduce bias, omit vital details, or compromise care quality. SCRIBE is built to safeguard against those risks.”
A second, related study published in JAMIA introduces a complementary framework for evaluating large language models integrated into the Epic electronic medical record system, specifically those used to generate draft responses to patient messages. The study assesses these AI-generated replies by comparing clinician feedback with automated evaluation metrics, focusing on attributes such as clarity, completeness, and safety.
While the models demonstrated strong performance in tone and readability, the study identified notable gaps in response completeness—highlighting the critical need for ongoing evaluation in real-world settings.
“This research helps bridge the gap between cutting-edge algorithms and meaningful clinical application,” said Michael Pencina, Ph.D., Chief Data Scientist at Duke Health and co-author of both studies. “It underscores that responsible AI implementation requires rigorous, ongoing evaluation as part of the technology’s entire life cycle—not just as a final step.”
Together, these two frameworks provide a robust foundation for the responsible integration of AI in healthcare. They equip clinical leaders, developers, and regulators with the tools necessary to evaluate AI models prior to deployment and to continuously monitor their performance—ensuring that these technologies enhance care delivery without compromising patient safety or trust.






















 
			
			
		 
			
			
		 
			
			
		 
		
		
	 
		
		
	 
		
		
	 
  


