When it Comes to Healthcare, AI Still Needs Human Supervision

Photo by Tara Winstead on Pexels

State-of-the-art artificial intelligence systems known as large language models (LLMs) are poor medical coders, according to researchers at the Icahn School of Medicine at Mount Sinai. Their study, published in NEJM AI, emphasises the necessity for refinement and validation of these technologies before considering clinical implementation.

The study extracted a list of more than 27 000 unique diagnosis and procedure codes from 12 months of routine care in the Mount Sinai Health System, while excluding identifiable patient data. Using the description for each code, the researchers prompted models from OpenAI, Google, and Meta to output the most accurate medical codes. The generated codes were compared with the original codes and errors were analysed for any patterns.

The investigators reported that all of the studied large language models, including GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, showed limited accuracy (below 50%) in reproducing the original medical codes, highlighting a significant gap in their usefulness for medical coding. GPT-4 demonstrated the best performance, with the highest exact match rates for ICD-9-CM (45.9%), ICD-10-CM (33.9%), and CPT codes (49.8%).

GPT-4 also produced the highest proportion of incorrectly generated codes that still conveyed the correct meaning. For example, when given the ICD-9-CM description “nodular prostate without urinary obstruction,” GPT-4 generated a code for “nodular prostate,” showcasing its comparatively nuanced understanding of medical terminology. However, even considering these technically correct codes, an unacceptably large number of errors remained.

The next best-performing model, GPT-3.5, had the greatest tendency toward being vague. It had the highest proportion of incorrectly generated codes that were accurate but more general in nature compared to the precise codes. In this case, when provided with the ICD-9-CM description “unspecified adverse effect of anesthesia,” GPT-3.5 generated a code for “other specified adverse effects, not elsewhere classified.”

“Our findings underscore the critical need for rigorous evaluation and refinement before deploying AI technologies in sensitive operational areas like medical coding,” says study corresponding author Ali Soroush, MD, MS, Assistant Professor of Data-Driven and Digital Medicine (D3M), and Medicine (Gastroenterology), at Icahn Mount Sinai. “While AI holds great potential, it must be approached with caution and ongoing development to ensure its reliability and efficacy in health care.”

Source: The Mount Sinai Hospital / Mount Sinai School of Medicine